[título del trabajo --14] - carnegie mellon school of · web viewat the word level,...

45
Omnivorous MT: Building NLP Systems for Multiple Languages Ariadna Font Llitjós 1 , Christian Monson 1 , Roberto Aranovich 2 , Lori Levin 1 , Ralf Brown 1 , Katharina Probst 3 , Erik Peterson 1 , Jaime Carbonell 1 , Alon Lavie 1 1 Language Technologies Institute School of Computer Science Carnegie Mellon University 2 Department of Linguistics University of Pittsburgh 3 Wherever Katha rin is Alon Lavie, Stephan Vogel, Lori Levin, Erik Peterson, Katharina Probst, Ariadna Font Llitj_os, Rachel Reynolds, Jaime Carbonell Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, 15213 and Richard Cohen University Center for International Studies, University of Pittsburgh, Pittsburgh, PA, 15260 Abstract By adopting a “first-things-first” approach we overcome a number of challenges inherent in developing NLP Systems for resource-scarce languages. By first gathering the necessary corpora and lexicons we are then enabled to build, for Mapudungun, a spelling-corrector, morphological analyzer, and two Mapudungun-Spanish machine translation systems; and for Quechua, a morphological analyzer as well as a rule-based Quechua-Spanish machine translation system. We have also worked on languages (Hebrew, Hindi, Dutch, and Chinese) for which resources such as lexicons and morphological analyzers were available. The flexibility of the AVENUE project architecture allows us to take a different approach as needed in each case. This paper describes the Mapudungun and Quechua systems. 1. Omnivorous MT In the natural language processing food chain, machine translation (MT) sits at the top. Fluent translation consumes lexical, morphological, syntactic, semantic, and pragmatic knowledge for each language in the translation pair—whether these pieces of knowledge come from hand-coded translation rules, from co-occurrence statistics in large corpora, or by some other means. Most work in natural language processing (NLP) focuses on a small set of languages. Consequently, an assortment of basic language processing resources and tools exist for languages from this small set. And with the lower level language tools available for consumption, MT systems among these languages can be constructed. The same breadth of resources is pointedly lacking for the large number of languages that have not made it into this elite set. Which S s carcity of language processing resources means fewer consumables for MT systems over these languages. The ecosystem of MT contains many competing MT species. Each species of MT requires different nourishment to thrive: a statistical MT system requires large parallel corpora, a rule-based MT system requires many human hours of rule writing and lexicon building, etc. The approach toward MT that the AVENUE project has adopted is

Upload: vankhuong

Post on 13-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Omnivorous MT Building NLP Systems for Multiple Languages

Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1

1 Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

2 Department of LinguisticsUniversity of Pittsburgh

3 Wherever Katharin is

Alon Lavie Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna FontLlitj_os Rachel Reynolds Jaime CarbonellLanguage Technologies Institute Carnegie Mellon University Pittsburgh PA 15213andRichard CohenUniversity Center for International Studies University of Pittsburgh Pittsburgh PA15260

Abstract

By adopting a ldquofirst-things-firstrdquo approach we overcome a number of challenges inherent in developing NLP Systems for resource-scarce languages By first gathering the necessary corpora and lexicons we are then enabled to build for Mapudungun a spelling-cor -rector morphological analyzer and two Mapudungun-Spanish machine translation systems and for Quechua a morphological ana-lyzer as well as a rule-based Quechua-Spanish machine translation system We have also worked on languages (Hebrew Hindi Dutch and Chinese) for which resources such as lexicons and morphological analyzers were available The flexibility of the AVENUE project architecture allows us to take a different approach as needed in each case This paper describes the Mapudungun and Quechua systems

1 Omnivorous MT In the natural language processing food chain ma-

chine translation (MT) sits at the top Fluent translation consumes lexical morphological syntactic semantic and pragmatic knowledge for each language in the translation pairmdashwhether these pieces of knowledge come from hand-coded translation rules from co-occurrence statistics in large corpora or by some other means

Most work in natural language processing (NLP) fo-cuses on a small set of languages Consequently an as-sortment of basic language processing resources and tools exist for languages from this small set And with the lower level language tools available for consumption MT sys-tems among these languages can be constructed The same breadth of resources is pointedly lacking for the large number of languages that have not made it into this elite set Which Sscarcity of language processing resources means fewer consumables for MT systems over these lan-guages

The ecosystem of MT contains many competing MT species Each species of MT requires different nourish-ment to thrive a statistical MT system requires large par-allel corpora a rule-based MT system requires many hu-man hours of rule writing and lexicon building etc The approach toward MT that the AVENUE project has adopted is to survey the language processing landscape noting what lower level language processing resources are avail-able for a particular language pair and then build the MT system appropriate for the available resources Hence the overarching MT system is lsquoomnivorousrsquo ingesting what-

ever resources are available to produce the highest quality MT possible given the resources

11 The AVENUE ProjectThe omnivorous AVENUE project at CMU encom-

passes four distinct approaches to MT statistical MT (SMT) example-based MT (EBMT) hand-written syntac-tic transfer rule-based MT (RBMT) and a newly devel-oped approach to automatically learn syntactic transfer rules This paper discusses our selective application of these four MT approaches to produce MT systems for four language pairs Mapudungun-Spanish Quechua-Spanish Hebrew-English and Hindi-English Each of these four language pairs consists of one language for which many natural language processing resources exist (Spanish or English) and a second language for which fewer resources exist (Mapudungun Quechua Hebrew or Hindi) This in-tentionally lopsided distribution of language resources be-tween the source and target languages allows us to lever-age the knowledge available for the resource-rich lan-guage into better translations of the resource-poor lan-guage For additional details on why we chose to translate from the resource-poor to the resource-rich language see HINDI PAPER

12 Required ResourcesError Reference source not found shows the resources

that each of the four AVENUE project MT approaches needs Because the AVENUE project assumes resources are available for the major language in each translation pair

Katharina Probst 121206
Author order not set in stone (PS this comment isnrsquot really by Katharina Probst It was merely written using her old laptop)
cmonson 100906
Add ref
aria 181206
We could also say something like In assimilation situations where the translation direction goes from resource-scarce languages to English (or Spanish) hellip
Katharina Probst 111206
The abstract hasnrsquot been redone yet
Carnegie Mellon University 271206
Authors from the Hindi TALIP 03 paper
Katharina Probst 111206
fix

and because the major language is the target translation language Error Reference source not found focuses on resources that are required for a low-resource source lan-guage

All methods of MT require resources of both human and computational knowledge The most basic human re-source that all four of our approaches to MT need is the availability of computational linguists familiar with build-ing MT systems Building an MT system requires compe-tency with computers and a working knowledge of the common techniques used in natural language processing Frequently however a computational linguist working on MT for a particular language pair does not speak or have detailed knowledge of both or either language in the pair Of course a more intimate knowledge of the languages of translation is generally helpful in building an MT system Of the four MT creation strategies used within the AVENUE project a computational linguist with language specific knowledge is particularly important to the hand-written RBMT approachmdashit is difficult for a person to write transfer rules for a language they do not speak or have not studied Another human resource that can be tapped for building MT systems is the knowledge that lin-guistically naiumlve non-expert bilingual speakers possess As discussed later in this section the automatic rule learn-ing approach to MT development which the AVENUE project has pioneered harnesses the expertise of bilingual speakers

Aside from resources of human knowledge and labor MT systems require computational resources Computa-tional resources fall into at least two categories First there are computational language processing resources such as morphological analysis of a source language Both the EBMT and the SMT systems within the AVENUE project can use and do benefit from shallow morphological analy-sis of the source sentence where complex surface word forms are segmented into individual morphemes And a deep morphological analysis that processes a source word into a stem and a set of morphosyntactic feature-value pairs is required for our rule-based MT formalism The syntactic transfer rules whether written by hand or auto-matically learned then reference the specific features and values of individual words and morphemes

Data resources form a second category of computa-tional resources needed for MT system development Nat-

ural language data comes in many different forms The simplest data needed by all the AVENUE MT systems is a translation lexicon which at a minimum lists word andor stem level translations between the source and the target languages Augmenting a simple lexicon with probability estimates for each translation of each entry benefits all four of our MT approaches And just as with morphologi-cal analysis if the lexicon contains feature-value pairs the rule-based MT methods may reference them Addition-ally the RBMT methods require word level part- of- speech information in the lexicon

While a lexicon itself possibly hand built is a suffi-cient data resource from which to produce a hand-written RBMT system each of the other three AVENUE MT meth-ods rely more heavily on automatic algorithms to describe correct translations and hence need additional data re-sources The corpus-based MT approaches SMT and EBMT both feed off sentence aligned bilingual corpora from which they glean a variety of translation statistics Bilingual corpora can also be a source of word translation probability estimates that the rule-based MT systems can use And finally the AVENUE approach to learning syntac-tic transfer rules requires a unique data resourcemdasha spe-cially designed elicitation corpus An AVENUE elicitation corpus consists of a set of phrases and sentences in the re-source-rich language A bilingual speaker uses a graphical tool to both translate the elicitation corpus and to then manually word-align the translations The phrases and sentences in the elicitation corpus are chosen to form min-imal and near-minimal pairs The rule learning algorithm then compares and contrasts the translations and word alignments of portions of the elicitation corpus which it knows are similarmdashgeneralizing from specific translations and word alignments to high level syntactic transfer rules used in translating unseen text (Probst 2005)

13 Available Resources Lead to Specific MT Solutions

Section X of this paper discusses the MT systems the AVENUE team built for each of our four language pairs in turn giving specific details on the available resources and systems that were built Here we present a brief side-by-side comparison of the resource constraints we worked under for each of the four resource-poor languages and

Human ResourcesComputational ResourcesComputational Linguist with Knowledge ofBilingual SpeakerProcessingDataMTLanguage PairMorphological AnalysisLexiconParallel CorpusElicitation CorpusCorpus-Based MTExample-BasedXIf availableIf availableXXStatisticalXIf availableIf availableXXRule-Based MTHand-writtenXXXXIf availableLearnedXIf availableXXXIf availableX

Table 1 Any approach to building machine translation (MT) systems requires resources This table presents and contrasts the resources required by the four AVENUE MT creation strategies Each lsquoXrsquo represents a resource (column) that is required for a particular approach to MT (row)

cmonson 072106
Fill in
Katharina Probst 100906
Give ref or tie in
aria 181206
Produce
aria 181206
I think the EBMT system also makes use of POShellip
aria 181206
Maybe it we could be more descriptive here and say use them to restrict the possible constituent combinations
cmonson 072506
Is this true Or are translation probabilities only used in the statistical systems
Katharina Probst 100906
Give reference

how these constraints led us to build particular MT sys-tems Error Reference source not found summarizes the available resources for each low-resource language At a glance with the possible exception of Hindi the AVENUE project had better access to human resources than to com-putational resources

The first column in Error Reference source not found which notes the availability of computational linguists with knowledge of machine translation techniques is pre-requisite to developing MT systems for any language The third column in Error Reference source not found is also fully filled in because the AVENUE project had access to bilingual speakers for each of our translation pairs

On the other hand for any particular resource-poor language there are few computational linguists with de-tailed knowledge of that language The AVENUE project elected to build a Hebrew MT system specifically because of project internal expertise on Hebrew Similarly we chose to work with the South American indigenous lan-guages Mapudungun and Quechua only after it became clear we would have some language specific expertise on each The AVENUE project specifically cultivated relation-ships with government civic linguistic and educational organizations of the native speaking populations of sev-eral Native American languages It is through these con-tacts that the AVENUE project gained computational lin-guistic expertise in Mapudungun and Quechua And after the AVENUE project had already begun work on Mapudun-gun a computational linguist with field work experience in Mapudungun joined our team Similarly another mem-ber of the AVENUE project visited Cuzco Peru to learn Quechua for three months Unlike Hebrew Mapudungun and Quechua the AVENUE team did not elect to work on Hindi The task of building an MT system for Hindi was presented to us as the surprise language portion of the DARPA XX project Hence no computational linguist on our team speaks Hindi Fortunately we had access to fairly significant computational resources for Hindi

As noted computational resources for low-resource languages are more difficult to come by than for resource-rich languages Only Hindi had either of an easily avail-able ready-made morphological analyzer or a significant amount of parallel sentence aligned data For Hebrew we were able to adapt a freely available morphological ana-lyzer while for Mapudungun and Quechua we built mor-phological analysis tools from scratch To our knowledge there is no previously existing Mapudungun morphologic-al analyzermdashhence we were forced to build our own And with the effort we had put into building a morphological analyzer for Mapudungun together with our acquired hu-man knowledge resources we were able to quickly create a basic morphological analyzer for Quechua

None of the four low resource languages had lexicons adequate for machine translation Sections X describe in detail the creation and adaptation of lexical resources for each individual language For two of the languages Hindi and Mapudungun the AVENUE project had access to rea-sonably sized sentence aligned parallel corpora Through participation in the DARPA surprise language exercise on Hindi the AVENUE project gained access to a large sen-tence aligned Hindi-English parallel corpus In contrast a significant effort within the AVENUE project as detailed in section X was devoted to creating transcribing and translating a bilingual Mapudungun-English Spanish speech corpus And rounding out the computational re-sources the elicitation corpus is a unique data resource in that it is created in the resource-rich language and then translated by bilingual speakers into the resource-poor language Because of the emphasis within the AVENUE project on collaboration with native speakers the elicita-tion corpus was available as a data resource in each of the four languages

Human ResourcesComputational ResourcesComputational Linguist with Knowledge ofBilingual SpeakerProcessingDataMTLanguage PairMorphological AnalysisLexiconParallel CorpusElicitation CorpusMapudungunXXXCreatedCreatedCreatedXQuechuaXLearnedXCreatedCreatedXHebrewXXXAdaptedAdaptedXHindiXExternal consultantLearnedXXCreated AdaptedXX

Table 2 Different languages with different available resources This table presents and contrasts the relevant available resources for the four low-resource languages discussed in this paper

cmonson 072406
Add section ref
cmonson 072406
Add reference to appropriate sections
cmonson 072506
Lori Are there other Quechua morphological analyzers out there that we should mention
aria 181206
Richard Cohen from Pitt acted as our Hindi expert and I wrote a manual grammar from what he taught us during a couple of sections So I am not sure what we should have in Table 2 but leaving it blank does not reflect the real situation
cmonson 102406
fix
cmonson 102706
Ackward

Error Reference source not found lists the MT sys-tems the AVENUE project developed for each of the four low-resource languages Within the AVENUE project the available resources dictate which MT approaches are fea-sible for a given language pair Corpus based approaches to MT were infeasible for the Hebrew to English and the Quechua to Spanish language pairs because we lacked sentence-aligned bilingual corpora between these lan-guages Similarly because the AVENUE team lacked first-hand knowledge of Hindi only a small hand-written rule-based MT system was developed for Hindi to English MT In addition to resource availability constraints time constraints have also affected which MT systems we built for each language Of the two corpus based approaches to MT we have only applied example-based MT to the task of Mapudungun to Spanish translation Similarly we have not yet employed the rule-learning approach to build MT systems for Mapudungun to Spanish or for Quechua to Spanish Applying the AVENUE developed automatic syn-tactic rule learning method of MT to Mapudungun and Quechua is high on our list of next steps

2 Four MT SystemsOver the past six years the AVENUE project at the Lan-

guage Technologies Institute at Carnegie Mellon Univer-sity has worked to produce basic MT systems for a num-ber of resource-poor languages Chief among our accom-plishments is the construction of MT systems for four lan-guages for which few or no MT systems have previously been built Mapudungun Quechua Hebrew and Hindi Mapudungun is the smallest of these four languages Less than 1 million people speak Mapudungun in southern Chile and Argentina Quechua a second indigenous lan-guage of South America is more widely spoken There are approximately 10 million speakers of Quechua in Peru Bolivia Ecuador southern Colombia and northern Argentina Hebrew is spoken by approximately 7 million people in Israel and Hindi is the largest of the four lan-guages for which we built MT systems There are several hundred million native and secondary speakers of Hindi in India and abroad Despite the large number of Hindi speakers Hindi is still resource-poor with only rudimen-tary natural language processing data and tools freely ob-tainable

Building MT systems for resource-poor languages is challenging Section contrasted the availability (and unavailability) of resources for each of our four resource-poor languages In sections X-X we consider each lan-guage in turnmdashspecifically illustrating how an omnivo-rous approach to MT allowed the AVENUE team to over-come the specific resource gaps of each language to build MT systems Finally in Section X we synthesize the lessons and experience we gained in building these ma-chine translation systems and present overall suggestions for others looking to build MT for languages with scarce computational resources

21 Two Indigenous South American Lan-guages

The omnivorous approach to MT within the AVENUE project was originally developed to tackle challenges we faced in developing MT for indigenous South American Languages under NSF grant Omnivorous MT re-quired close collaboration with native speaking popula-tions The AVENUE project relies on native bilingual speakers to provide crucial information toward building MT systems After considering several candidate lan-guages the AVENUE team elected to work with the lan-guages of Mapudungun and Quechua as they promised the closest collaboration with native language organiza-tions The details of the collaborations with the native or-ganizations are presented in sections X and X

Even with a strong working relationship with native speakers of Mapudungun and Quechua the AVENUE team faced serious hurdles to creating functional MT for these languages At the time the AVENUE team started working on Mapudungun even simple natural language tools such as morphological analyzers or spelling correctors did not exist In fact there were few electronic resources from which such natural language tools might be built There were no standard Mapudungun text or speech corpora or lexicons and no parsed treebanks The text that did exist was in a variety of competing orthographic formats More resources exist for Quechua but they are still far from what is needed to build a complete MT system

Mapudungun and Quechua share challenges for MT creation beyond simple resource scarcity however These additional challenges stem from both Mapudungun and

Corpus Based MTRule-Based MTExample-BasedStatisticalHand WrittenLearnedMapudungunXXQuechuaXHebrewXXHindiXXXsmallX

Table 3 Languages vs MT systems Depending on the available resources the AVENUE project built a variety of MT systems for each language

Katharina Probst 101706
fill in
Katharina Probst 101706
add number
Katharina Probst 101706
Fill in
Katharina Probst 101706
Fill in
Katharina Probst 101206
Which section
aria 181206
I donrsquot think so since the Quechua grammar is also rather small 25 Gr rules (643 stem lexical entries and 36 suffixes) I think the Hindi grammar had something like 70 rules in the grammar Maybe table 3 could contain the number of rules instead of just a X
cmonson 121806
Is this fair to say The Quechua MT system is pretty small toohellip Ari Yes (see my comment below)

Quechua being indigenous South American languages First both Mapudungun and Quechua pose unique chal-lenges from a linguistic theory perspective Although Ma-pudungun and Quechua are unrelated languages they both have complex agglutinative morphological structures In addition Mapudungun is polysynthetic incorporating ob-jects into the verb of a sentence Agglutination and polysynthesis are both linguistic properties that the major-ity of languages for which natural language resources have been built do not possess Second human factors also pose a particular challenge for these two languages Specifically there is a scarcity of people trained in com-putational linguistics who are native speakers or who have a good knowledge of these indigenous languages And fi-nally an often overlooked challenge that confronts devel-opment of NLP tools for resource-scarce languages is the divergence of the culture of the native speaker population from the western culture of computational linguistics

Despite the challenges facing the development of natu-ral language processing systems for Mapudungun and Quechua the AVENUE team has developed a suite of basic language resources for each of these languages and has then leveraged these resources into more sophisticated natural language processing tools The AVENUE project led a collaborative group of Mapudungun speakers in first collecting and then transcribing and translating into Span-ish by far the largest spoken corpus of Mapudungun avail-able From this corpus we then built a spelling checker and a morphological analyzer for Mapudungun For Quechua we have created parallel and aligned data as well as a basic bilingual lexicon with morphological infor-mation And with these resources we have built two proto-type machine translation (MT) systems for Mapudungun and one prototype MT system for Quechua The following sections detail the construction of these resources focus-ing on overcoming the specific challenges Mapudungun and Quechua each present as resource-scarce languages

22 MAPUDUNGUNSince May of 2000 in an effort to ultimately produce

Mapudungun-Spanish machine translation computational linguists at CMUrsquos Language Technologies Institute have collaborated with Mapudungun language experts at the In-stituto de Estudios Indigenas (IEI - Institute for Indige-nous Studies) at the Universidad de la Frontera (UFRO) in Chile and with the Bilingual and Multicultural Education Program of the Ministry of Education (Mineduc) in Chile (Levin et al 2000)

From the very outset of our collaboration we battled with the scarcity of electronic resources for Mapudungun The first phase of the AVENUE collaboration was to collect and produce parallel Mapudungun-Spanish language data from which higher-level language processing tools and systems could be built One barrier we faced in the collec-tion of Mapudungun language data is that there are cur-rently several competing orthographic conventions for written Mapudungun Early in the AVENUE Mapudungun collaboration the IEI-UFRO team established a set of or-thographic conventions All the data collected under the AVENUE project conforms to this set of orthographic con-ventions If the Mapudungun data was originally in some other orthographic convention then we manually con-verted it into our own orthography Recently however a different orthography Azuumlmchefi has been chosen by the

Chilean government for official documents Portions of our data have been automatically converted into Azuumlm-chefi using automatic substitution rules

221 Corpora and LexiconsRecognizing the appetite of MT systems for data re-

sources and the scarcity of Mapudungun language data the AVENUE team began collecting Mapudungun data soon after our collaboration began with IEI and UFRO in 2000 Directed from CMU and conducted by native speakers of Mapudungun at the UFRO in Temuco Chile our data col-lection efforts ultimately resulted in three separate cor-pora 1) a small parallel Mapudungun-Spanish corpus of historical texts and newspaper text 2) 1700 Spanish sen-tences that were manually translated and aligned into Ma-pudungun These 1700 Spanish sentences were carefully crafted to elicit specific syntactic phenomena in Ma-pudungun and 3) a parallel corpus consisting of 170 hours of transcribed and translated Mapudungun speech Natural language translation is challenging even for hu-mans particularly when the source text to translate is spo-ken language with its many disfluencies and colloquial expressions The translation of the speech corpus was completed by bilingual speakers of Mapudungun and Spanish but not by professional translators who attempted to translate as literally as possible Furthermore each sec-tion of the speech corpus was only ever viewed by a sin-gle translator All of these translation challenges combine to render the Spanish translation of the Mapudungun text imperfect Still a bilingual corpus is a valuable resource Additional details on corpora are described in Monson et al 2004

While small the second corpus listed above the struc-tural elicitation corpus has aided our development of the hand-built rule-based MT system for Mapudungun (see section XX) The larger spoken corpus was put to two specific uses to develop MT for Mapudungun First the transcribed and translated spoken corpus was used as a parallel corpus to develop an Example-Based MT system described further in section XX Second the AVENUE team used the spoken corpus to derive a bilingual stem lexicon for use in Rule-Based MT systems

To obtain the bilingual stem lexicon the AVENUE team first converted the transcribed spoken corpus text into a monolingual Mapudungun word list The first 117003 most frequent fully-inflected Mapudungun word forms were hand-checked for spelling according to the adopted orthographic conventions for Mapudungun And 15120 of the most frequent spelling corrected fully inflected word forms were hand segmented into two parts The first part consisting of the stem of the word and the second part consisting of one or more suffixes This produced 5234 stems and 1303 suffix groups These 5234 stems were then translated into Spanish

2211 Mapudungun morphological analyzerIn addition to consuming data resources the MT sys-

tems used in the AVENUE project consume the computa-tional resource of morphological analysis Mapudungun is an agglutinative and polysynthetic language Figure 2 contains glosses of a few morphologically complex Ma-pudungun verbs that occur in our corpus of spoken Ma-pudungun Monson et al documents the large num-ber of occuring Mapudungun word forms in this speech corpus A typical complex verb form in our spoken corpus

cmonson 112206
Fix ref
cmonson 112206
Take this figure from a previous paper
cmonson 112206
add further whittling down of stems due to Robertorsquos more stringent criteria
cmonson 111506
add section
cmonson 112506
Is this fair Clean up this text
cmonson 111506
If this is true then why was orthography a concern in running Robertorsquos MT system on the newspaper data

consists of five or six morphemes Agglutinative morpho-logical processes explode the number of possible surface word forms while simultaneously causing individual sur-face word forms to be rare Furthermore each morpheme may alone contain several morphosyntactic features Hence it is difficult for an MT system to translate Ma-pudungun words directly into another language Analyz-ing the morphology of Mapudungun allows our MT sys-tems to cover a larger fraction of all Mapudungun surface word forms And by identifying each morpheme in each Mapudungun word and assigning a meaning to each iden-tified morpheme a rule-based MT system can translate in-dividually each piece of meaning

We had found few data resources available for Ma-pudungun So we were not surprised to learn that no mor-phological analyzer for Mapudungun had ever been writ-ten Necessarily then we developed our own morphologi-cal analyzer Using the stem lexicon we had created and reference grammars of Mapudungun our morphological analyzer takes a Mapudungun word as input and as output produces all possible morphological analyses of the word Each analysis specifies

A single stem in that word

Each suffix in that word

A syntactic analy-sis for the stem and each identified suffix

To identify the morphemes (stem and suffixes) in a Mapudungun word a lexicon of stems works together with a fairly complete lexicon of Mapudungun suffixes The current version of the stem lexicon contains XXX stems and their Spanish translations Each entry in this lexicon lists the part of speech of the stem as well as other features associated with the stem such as lexical aspect in the case of verb stems The suffix lexicon built by hand by computational linguists on the AVENUE team contains all the uncontroversial inflectional morphemes of Ma-pudungun There are XXX Mapudungun suffixes in the suffix lexicon Each suffix lists the part of speech that the suffix attaches to as well as the linguistic features such as person number mood etc that it marks The morpho-logical analyzer performs a recursive and exhaustive search on all possible segmentations of a given Mapudun-gun word The analyzer starts from the beginning of the word and identifies and removes each stem that is an ini-tial string in that word The current morphological ana-lyzer does not handle noun incorporation a fairly rare phenomenon in Mapudungun The analyzer then examines the remaining string looking for a valid combination of suffixes that could complete the word A suffix combina-tion is valid if Each of the

suffixes in the combination can attach to stems with the part of speech of the current stem and

The suffixes occur in a grammatical order

Mapudungun morphology is relatively unambiguous Most Mapudungun words for which the stem is in the stem lexicon receive a single analysis A few truly am-biguous suffix combinations may cause a Mapudungun word to receive more than one distinct analysis

Once the morphological analyzer has found all possi-ble segmentations of a word it combines the feature infor-mation from the stem and the suffixes encountered in the analyzed word to create a syntactic analysis that is re-turned For an example see Figure 3

222 Two Machine Translation SystemsWith the basic data and computational resources built

for Mapudungun the AVENUE team could proceed toward constructing machine translation systems from Mapudun-gun into Spanish We have developed two prototype Ma-pudungun-Spanish MT systems The first MT system is a data driven paradigm called Example-Based MT An EBMT system can exploit the decent sized spoken corpus of Mapudungun as a source of data to produce translations with higher lexical coverage but lower translation accu-racy The second MT system is a hand-built Rule-Based MT system Building an MT system by hand is feasible only because the AVENUE team has computational lin-guists available as a resource The human-developed Rule-Based MT system complements the EBMT system with high quality translations of specific syntactic con-structions over a smaller vocabulary

2221 An Example-Based Machine Translation Sys-tem

Example-Based Machine Translation (EBMT) relies on previous translations performed by humans to create new translations without the need for human translators The previous translations are called the training corpus For the best translation quality the training corpus should be as large as possible and as similar to the text to be translated as possible When the exact sentence to be translated occurs in the training material the translation quality is human-level because the previous translation is re-used As the source sentence differs more and more from the training material quality decreases because smaller and smaller fragments must be combined to pro-duce the translation increasing the chances of an incorrect translation

As the amount of training material decreases so does the translation quality because there are fewer long matches between the training texts and the input to be translated Conversely more training data can be added at any time improving the systems performance by allow-ing more and longer matches EBMT usually finds only partial matches which generate lower-quality translations Further EBMT searches for phrases of two or more words and thus there can be portions of the input which do not produce phrasal translations For unmatched por-tions of the input EBMT falls back on a probabilistic lexi-con trained from the corpus to produce word-for-word translations This fall-back approach provides some trans-lation (even though of lower quality) for any word form encountered in the training data While the translation quality of an EBMT system can be human-level when long phrases or entire sentence are matched any mistakes in the human translations used for trainingmdashspelling er-rors omissions mistranslationsmdashwill become visible in the EBMT systems output Thus it is important that the training data be as accurate as possible The training cor-pus we use for EBMT is the spoken language corpus de-scribed earlier As discussed in section the corpus of spoken Mapudungun contains some errors and awkward translations

cmonson 112506
fix
cmonson 112206
Get from previous paper
cmonson 111606
How many
cmonson 111606
How many
aria 181206
It might be better to say However no morphologicalhellip so we developed our own

Highly agglutinative languages like Mapudungun pose a challenge for Example Based MT Because there are so many inflected versions of each stem most inflected words are rare If the rare words do not occur in the cor-pus at all they will not be translatable by EBMT If they occur only a few times it will also be hard for EBMT to have accurate statistics about how they are used Addi-tionally automatic word-level alignment between the two languages which is used to determine the appropriate translation of a partial match becomes more difficult when individual words in one language correspond to many words in the other language We address both issues by using the morphological analyzer for Mapudungun to split words into individual stems and suffixes Alone each stem and suffix is more common than the original combi-nation of stem and suffixes and the separate morphemes are more likely to map to single words in Spanish

We currently have a prototype EBMT system trained on approximately 204000 sentence pairs from the corpus of spoken Mapudungun containing a total of 125 million Mapudungun tokens after splitting the original words into stems and one or more suffixes This separation increased BLEU scores (Papineni et al 2001) on a held out portion of the speech corpus by 548 from 01530 to 01614 We expect further improvements in MT accuracy from more detailed use of morphological analysis the inclusion of common phrases in the corpus and from repairing translation errors and awkward translations in the corpus

2222 A Rule-Based MT systemThe second MT system the AVENUE team has devel-

oped is a Rule-Based system where the rules have been built by hand Manually developing a Rule-based MT (RBMT) system involves detailed comparative analysis of the grammar of source and target languages Manually im-buing an MT system with detailed syntactic knowledge can produce high quality translation but typically takes a significant amount of time to implement Despite the time requirements developing a Rule-Based MT system for Mapudungun by hand is a natural fit in our case for two reasons First the AVENUE project has the necessary hu-

man expertise available to hand-develop an MT system with significant syntactic understanding And second hand-built rule-based MT nicely complements the weak-nesses of EBMT EBMT focuses on high translation cov-erage at the expense of accuracy in contrast the rule-based MT system does not exhaustively cover Mapudun-gun grammar but what rules we wrote cover a portion of Mapudungun with high accuracy

At a high level the AVENUE Rule-Based MT frame-work is composed of three modules a source language morphological analyzer the syntactic rule-based transfer system itself and a target language morphological genera-tor The source language morphological analyzer pro-cesses the raw input source sentence into a lattice of all possible morphological segmentations The transfer sys-tem maps the morphologically analyzed source sentences into target language expressions via a transfer lexicon and grammar The transfer grammar and lexicon contain syn-tactic and lexical rules that explicitly refer to morphosyn-tactic features of the two languages (more details are given in the next subsections) The output of the transfer system is a target language expression composed of unin-flected words bundled with grammatical features It is then the morphological generatorrsquos job to massage the un-inflected word-feature bundles into fully inflected target language output sentences

In the Mapudungun-Spanish Rule-Based MT system the source language morphological analyzer is the Ma-pudungun morphological analyzer described earlier in section The Mapudungun to Spanish transfer gram-mar and lexicon are described in section While the Spanish morphological generator is described in section But before delving into details of the Ma-pudungun-Spanish rule-based MT system a short over-view of the AVENUE rule-based MT framework is needed

2223 Overview of Transfer Rule Syntax and the Run-time Transfer System

This paper does not describe in detail the rationale be-hind either the transfer rules used in the AVENUE Rule-Based MT system or the run-time MT transfer engine

Figure A simplified hand-written AVENUE style transfer rule This rule transfers Mapudungun nominal expressions that mark for number into Spanish nominal expressions Where Spanish marks number morphologically on the head noun Mapudungun uses a separate particle to mark number on nouns

NBar1

NbarNbar [PART N] -gt [N]

((X2Y1)

((X1 number) =c pl)((X0 number) = (X1 number))

((Y0 number) = (Y1 number))((Y0 gender) = (Y1 gender))

((Y0 number) = (X0 number)))1) Rule identifier

2) SL and TL Constituent Structure Specification

3) SL (X2) to TL (Y1) Constituent Alignment

4) SL Constraint This rule applies only if SL number is pl4) SL Constraint SL number is passed up to the SL NBar

5) TL Constraint TL number is passed up to the TL NBar5) TL Constraint TL gender is passed up to the TL NBar

6) Transfer Equation At this point the number on the TL NBar must equal the number on the SL NBar

cmonson 112706
fix ref
cmonson 112706
fix ref
cmonson 112706
fix ref
aria 181206
We could mention that the MEMT approach (cite paper) can combine the benefits of both (at least in theory)

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 2: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

and because the major language is the target translation language Error Reference source not found focuses on resources that are required for a low-resource source lan-guage

All methods of MT require resources of both human and computational knowledge The most basic human re-source that all four of our approaches to MT need is the availability of computational linguists familiar with build-ing MT systems Building an MT system requires compe-tency with computers and a working knowledge of the common techniques used in natural language processing Frequently however a computational linguist working on MT for a particular language pair does not speak or have detailed knowledge of both or either language in the pair Of course a more intimate knowledge of the languages of translation is generally helpful in building an MT system Of the four MT creation strategies used within the AVENUE project a computational linguist with language specific knowledge is particularly important to the hand-written RBMT approachmdashit is difficult for a person to write transfer rules for a language they do not speak or have not studied Another human resource that can be tapped for building MT systems is the knowledge that lin-guistically naiumlve non-expert bilingual speakers possess As discussed later in this section the automatic rule learn-ing approach to MT development which the AVENUE project has pioneered harnesses the expertise of bilingual speakers

Aside from resources of human knowledge and labor MT systems require computational resources Computa-tional resources fall into at least two categories First there are computational language processing resources such as morphological analysis of a source language Both the EBMT and the SMT systems within the AVENUE project can use and do benefit from shallow morphological analy-sis of the source sentence where complex surface word forms are segmented into individual morphemes And a deep morphological analysis that processes a source word into a stem and a set of morphosyntactic feature-value pairs is required for our rule-based MT formalism The syntactic transfer rules whether written by hand or auto-matically learned then reference the specific features and values of individual words and morphemes

Data resources form a second category of computa-tional resources needed for MT system development Nat-

ural language data comes in many different forms The simplest data needed by all the AVENUE MT systems is a translation lexicon which at a minimum lists word andor stem level translations between the source and the target languages Augmenting a simple lexicon with probability estimates for each translation of each entry benefits all four of our MT approaches And just as with morphologi-cal analysis if the lexicon contains feature-value pairs the rule-based MT methods may reference them Addition-ally the RBMT methods require word level part- of- speech information in the lexicon

While a lexicon itself possibly hand built is a suffi-cient data resource from which to produce a hand-written RBMT system each of the other three AVENUE MT meth-ods rely more heavily on automatic algorithms to describe correct translations and hence need additional data re-sources The corpus-based MT approaches SMT and EBMT both feed off sentence aligned bilingual corpora from which they glean a variety of translation statistics Bilingual corpora can also be a source of word translation probability estimates that the rule-based MT systems can use And finally the AVENUE approach to learning syntac-tic transfer rules requires a unique data resourcemdasha spe-cially designed elicitation corpus An AVENUE elicitation corpus consists of a set of phrases and sentences in the re-source-rich language A bilingual speaker uses a graphical tool to both translate the elicitation corpus and to then manually word-align the translations The phrases and sentences in the elicitation corpus are chosen to form min-imal and near-minimal pairs The rule learning algorithm then compares and contrasts the translations and word alignments of portions of the elicitation corpus which it knows are similarmdashgeneralizing from specific translations and word alignments to high level syntactic transfer rules used in translating unseen text (Probst 2005)

13 Available Resources Lead to Specific MT Solutions

Section X of this paper discusses the MT systems the AVENUE team built for each of our four language pairs in turn giving specific details on the available resources and systems that were built Here we present a brief side-by-side comparison of the resource constraints we worked under for each of the four resource-poor languages and

Human ResourcesComputational ResourcesComputational Linguist with Knowledge ofBilingual SpeakerProcessingDataMTLanguage PairMorphological AnalysisLexiconParallel CorpusElicitation CorpusCorpus-Based MTExample-BasedXIf availableIf availableXXStatisticalXIf availableIf availableXXRule-Based MTHand-writtenXXXXIf availableLearnedXIf availableXXXIf availableX

Table 1 Any approach to building machine translation (MT) systems requires resources This table presents and contrasts the resources required by the four AVENUE MT creation strategies Each lsquoXrsquo represents a resource (column) that is required for a particular approach to MT (row)

cmonson 072106
Fill in
Katharina Probst 100906
Give ref or tie in
aria 181206
Produce
aria 181206
I think the EBMT system also makes use of POShellip
aria 181206
Maybe it we could be more descriptive here and say use them to restrict the possible constituent combinations
cmonson 072506
Is this true Or are translation probabilities only used in the statistical systems
Katharina Probst 100906
Give reference

how these constraints led us to build particular MT sys-tems Error Reference source not found summarizes the available resources for each low-resource language At a glance with the possible exception of Hindi the AVENUE project had better access to human resources than to com-putational resources

The first column in Error Reference source not found which notes the availability of computational linguists with knowledge of machine translation techniques is pre-requisite to developing MT systems for any language The third column in Error Reference source not found is also fully filled in because the AVENUE project had access to bilingual speakers for each of our translation pairs

On the other hand for any particular resource-poor language there are few computational linguists with de-tailed knowledge of that language The AVENUE project elected to build a Hebrew MT system specifically because of project internal expertise on Hebrew Similarly we chose to work with the South American indigenous lan-guages Mapudungun and Quechua only after it became clear we would have some language specific expertise on each The AVENUE project specifically cultivated relation-ships with government civic linguistic and educational organizations of the native speaking populations of sev-eral Native American languages It is through these con-tacts that the AVENUE project gained computational lin-guistic expertise in Mapudungun and Quechua And after the AVENUE project had already begun work on Mapudun-gun a computational linguist with field work experience in Mapudungun joined our team Similarly another mem-ber of the AVENUE project visited Cuzco Peru to learn Quechua for three months Unlike Hebrew Mapudungun and Quechua the AVENUE team did not elect to work on Hindi The task of building an MT system for Hindi was presented to us as the surprise language portion of the DARPA XX project Hence no computational linguist on our team speaks Hindi Fortunately we had access to fairly significant computational resources for Hindi

As noted computational resources for low-resource languages are more difficult to come by than for resource-rich languages Only Hindi had either of an easily avail-able ready-made morphological analyzer or a significant amount of parallel sentence aligned data For Hebrew we were able to adapt a freely available morphological ana-lyzer while for Mapudungun and Quechua we built mor-phological analysis tools from scratch To our knowledge there is no previously existing Mapudungun morphologic-al analyzermdashhence we were forced to build our own And with the effort we had put into building a morphological analyzer for Mapudungun together with our acquired hu-man knowledge resources we were able to quickly create a basic morphological analyzer for Quechua

None of the four low resource languages had lexicons adequate for machine translation Sections X describe in detail the creation and adaptation of lexical resources for each individual language For two of the languages Hindi and Mapudungun the AVENUE project had access to rea-sonably sized sentence aligned parallel corpora Through participation in the DARPA surprise language exercise on Hindi the AVENUE project gained access to a large sen-tence aligned Hindi-English parallel corpus In contrast a significant effort within the AVENUE project as detailed in section X was devoted to creating transcribing and translating a bilingual Mapudungun-English Spanish speech corpus And rounding out the computational re-sources the elicitation corpus is a unique data resource in that it is created in the resource-rich language and then translated by bilingual speakers into the resource-poor language Because of the emphasis within the AVENUE project on collaboration with native speakers the elicita-tion corpus was available as a data resource in each of the four languages

Human ResourcesComputational ResourcesComputational Linguist with Knowledge ofBilingual SpeakerProcessingDataMTLanguage PairMorphological AnalysisLexiconParallel CorpusElicitation CorpusMapudungunXXXCreatedCreatedCreatedXQuechuaXLearnedXCreatedCreatedXHebrewXXXAdaptedAdaptedXHindiXExternal consultantLearnedXXCreated AdaptedXX

Table 2 Different languages with different available resources This table presents and contrasts the relevant available resources for the four low-resource languages discussed in this paper

cmonson 072406
Add section ref
cmonson 072406
Add reference to appropriate sections
cmonson 072506
Lori Are there other Quechua morphological analyzers out there that we should mention
aria 181206
Richard Cohen from Pitt acted as our Hindi expert and I wrote a manual grammar from what he taught us during a couple of sections So I am not sure what we should have in Table 2 but leaving it blank does not reflect the real situation
cmonson 102406
fix
cmonson 102706
Ackward

Error Reference source not found lists the MT sys-tems the AVENUE project developed for each of the four low-resource languages Within the AVENUE project the available resources dictate which MT approaches are fea-sible for a given language pair Corpus based approaches to MT were infeasible for the Hebrew to English and the Quechua to Spanish language pairs because we lacked sentence-aligned bilingual corpora between these lan-guages Similarly because the AVENUE team lacked first-hand knowledge of Hindi only a small hand-written rule-based MT system was developed for Hindi to English MT In addition to resource availability constraints time constraints have also affected which MT systems we built for each language Of the two corpus based approaches to MT we have only applied example-based MT to the task of Mapudungun to Spanish translation Similarly we have not yet employed the rule-learning approach to build MT systems for Mapudungun to Spanish or for Quechua to Spanish Applying the AVENUE developed automatic syn-tactic rule learning method of MT to Mapudungun and Quechua is high on our list of next steps

2 Four MT SystemsOver the past six years the AVENUE project at the Lan-

guage Technologies Institute at Carnegie Mellon Univer-sity has worked to produce basic MT systems for a num-ber of resource-poor languages Chief among our accom-plishments is the construction of MT systems for four lan-guages for which few or no MT systems have previously been built Mapudungun Quechua Hebrew and Hindi Mapudungun is the smallest of these four languages Less than 1 million people speak Mapudungun in southern Chile and Argentina Quechua a second indigenous lan-guage of South America is more widely spoken There are approximately 10 million speakers of Quechua in Peru Bolivia Ecuador southern Colombia and northern Argentina Hebrew is spoken by approximately 7 million people in Israel and Hindi is the largest of the four lan-guages for which we built MT systems There are several hundred million native and secondary speakers of Hindi in India and abroad Despite the large number of Hindi speakers Hindi is still resource-poor with only rudimen-tary natural language processing data and tools freely ob-tainable

Building MT systems for resource-poor languages is challenging Section contrasted the availability (and unavailability) of resources for each of our four resource-poor languages In sections X-X we consider each lan-guage in turnmdashspecifically illustrating how an omnivo-rous approach to MT allowed the AVENUE team to over-come the specific resource gaps of each language to build MT systems Finally in Section X we synthesize the lessons and experience we gained in building these ma-chine translation systems and present overall suggestions for others looking to build MT for languages with scarce computational resources

21 Two Indigenous South American Lan-guages

The omnivorous approach to MT within the AVENUE project was originally developed to tackle challenges we faced in developing MT for indigenous South American Languages under NSF grant Omnivorous MT re-quired close collaboration with native speaking popula-tions The AVENUE project relies on native bilingual speakers to provide crucial information toward building MT systems After considering several candidate lan-guages the AVENUE team elected to work with the lan-guages of Mapudungun and Quechua as they promised the closest collaboration with native language organiza-tions The details of the collaborations with the native or-ganizations are presented in sections X and X

Even with a strong working relationship with native speakers of Mapudungun and Quechua the AVENUE team faced serious hurdles to creating functional MT for these languages At the time the AVENUE team started working on Mapudungun even simple natural language tools such as morphological analyzers or spelling correctors did not exist In fact there were few electronic resources from which such natural language tools might be built There were no standard Mapudungun text or speech corpora or lexicons and no parsed treebanks The text that did exist was in a variety of competing orthographic formats More resources exist for Quechua but they are still far from what is needed to build a complete MT system

Mapudungun and Quechua share challenges for MT creation beyond simple resource scarcity however These additional challenges stem from both Mapudungun and

Corpus Based MTRule-Based MTExample-BasedStatisticalHand WrittenLearnedMapudungunXXQuechuaXHebrewXXHindiXXXsmallX

Table 3 Languages vs MT systems Depending on the available resources the AVENUE project built a variety of MT systems for each language

Katharina Probst 101706
fill in
Katharina Probst 101706
add number
Katharina Probst 101706
Fill in
Katharina Probst 101706
Fill in
Katharina Probst 101206
Which section
aria 181206
I donrsquot think so since the Quechua grammar is also rather small 25 Gr rules (643 stem lexical entries and 36 suffixes) I think the Hindi grammar had something like 70 rules in the grammar Maybe table 3 could contain the number of rules instead of just a X
cmonson 121806
Is this fair to say The Quechua MT system is pretty small toohellip Ari Yes (see my comment below)

Quechua being indigenous South American languages First both Mapudungun and Quechua pose unique chal-lenges from a linguistic theory perspective Although Ma-pudungun and Quechua are unrelated languages they both have complex agglutinative morphological structures In addition Mapudungun is polysynthetic incorporating ob-jects into the verb of a sentence Agglutination and polysynthesis are both linguistic properties that the major-ity of languages for which natural language resources have been built do not possess Second human factors also pose a particular challenge for these two languages Specifically there is a scarcity of people trained in com-putational linguistics who are native speakers or who have a good knowledge of these indigenous languages And fi-nally an often overlooked challenge that confronts devel-opment of NLP tools for resource-scarce languages is the divergence of the culture of the native speaker population from the western culture of computational linguistics

Despite the challenges facing the development of natu-ral language processing systems for Mapudungun and Quechua the AVENUE team has developed a suite of basic language resources for each of these languages and has then leveraged these resources into more sophisticated natural language processing tools The AVENUE project led a collaborative group of Mapudungun speakers in first collecting and then transcribing and translating into Span-ish by far the largest spoken corpus of Mapudungun avail-able From this corpus we then built a spelling checker and a morphological analyzer for Mapudungun For Quechua we have created parallel and aligned data as well as a basic bilingual lexicon with morphological infor-mation And with these resources we have built two proto-type machine translation (MT) systems for Mapudungun and one prototype MT system for Quechua The following sections detail the construction of these resources focus-ing on overcoming the specific challenges Mapudungun and Quechua each present as resource-scarce languages

22 MAPUDUNGUNSince May of 2000 in an effort to ultimately produce

Mapudungun-Spanish machine translation computational linguists at CMUrsquos Language Technologies Institute have collaborated with Mapudungun language experts at the In-stituto de Estudios Indigenas (IEI - Institute for Indige-nous Studies) at the Universidad de la Frontera (UFRO) in Chile and with the Bilingual and Multicultural Education Program of the Ministry of Education (Mineduc) in Chile (Levin et al 2000)

From the very outset of our collaboration we battled with the scarcity of electronic resources for Mapudungun The first phase of the AVENUE collaboration was to collect and produce parallel Mapudungun-Spanish language data from which higher-level language processing tools and systems could be built One barrier we faced in the collec-tion of Mapudungun language data is that there are cur-rently several competing orthographic conventions for written Mapudungun Early in the AVENUE Mapudungun collaboration the IEI-UFRO team established a set of or-thographic conventions All the data collected under the AVENUE project conforms to this set of orthographic con-ventions If the Mapudungun data was originally in some other orthographic convention then we manually con-verted it into our own orthography Recently however a different orthography Azuumlmchefi has been chosen by the

Chilean government for official documents Portions of our data have been automatically converted into Azuumlm-chefi using automatic substitution rules

221 Corpora and LexiconsRecognizing the appetite of MT systems for data re-

sources and the scarcity of Mapudungun language data the AVENUE team began collecting Mapudungun data soon after our collaboration began with IEI and UFRO in 2000 Directed from CMU and conducted by native speakers of Mapudungun at the UFRO in Temuco Chile our data col-lection efforts ultimately resulted in three separate cor-pora 1) a small parallel Mapudungun-Spanish corpus of historical texts and newspaper text 2) 1700 Spanish sen-tences that were manually translated and aligned into Ma-pudungun These 1700 Spanish sentences were carefully crafted to elicit specific syntactic phenomena in Ma-pudungun and 3) a parallel corpus consisting of 170 hours of transcribed and translated Mapudungun speech Natural language translation is challenging even for hu-mans particularly when the source text to translate is spo-ken language with its many disfluencies and colloquial expressions The translation of the speech corpus was completed by bilingual speakers of Mapudungun and Spanish but not by professional translators who attempted to translate as literally as possible Furthermore each sec-tion of the speech corpus was only ever viewed by a sin-gle translator All of these translation challenges combine to render the Spanish translation of the Mapudungun text imperfect Still a bilingual corpus is a valuable resource Additional details on corpora are described in Monson et al 2004

While small the second corpus listed above the struc-tural elicitation corpus has aided our development of the hand-built rule-based MT system for Mapudungun (see section XX) The larger spoken corpus was put to two specific uses to develop MT for Mapudungun First the transcribed and translated spoken corpus was used as a parallel corpus to develop an Example-Based MT system described further in section XX Second the AVENUE team used the spoken corpus to derive a bilingual stem lexicon for use in Rule-Based MT systems

To obtain the bilingual stem lexicon the AVENUE team first converted the transcribed spoken corpus text into a monolingual Mapudungun word list The first 117003 most frequent fully-inflected Mapudungun word forms were hand-checked for spelling according to the adopted orthographic conventions for Mapudungun And 15120 of the most frequent spelling corrected fully inflected word forms were hand segmented into two parts The first part consisting of the stem of the word and the second part consisting of one or more suffixes This produced 5234 stems and 1303 suffix groups These 5234 stems were then translated into Spanish

2211 Mapudungun morphological analyzerIn addition to consuming data resources the MT sys-

tems used in the AVENUE project consume the computa-tional resource of morphological analysis Mapudungun is an agglutinative and polysynthetic language Figure 2 contains glosses of a few morphologically complex Ma-pudungun verbs that occur in our corpus of spoken Ma-pudungun Monson et al documents the large num-ber of occuring Mapudungun word forms in this speech corpus A typical complex verb form in our spoken corpus

cmonson 112206
Fix ref
cmonson 112206
Take this figure from a previous paper
cmonson 112206
add further whittling down of stems due to Robertorsquos more stringent criteria
cmonson 111506
add section
cmonson 112506
Is this fair Clean up this text
cmonson 111506
If this is true then why was orthography a concern in running Robertorsquos MT system on the newspaper data

consists of five or six morphemes Agglutinative morpho-logical processes explode the number of possible surface word forms while simultaneously causing individual sur-face word forms to be rare Furthermore each morpheme may alone contain several morphosyntactic features Hence it is difficult for an MT system to translate Ma-pudungun words directly into another language Analyz-ing the morphology of Mapudungun allows our MT sys-tems to cover a larger fraction of all Mapudungun surface word forms And by identifying each morpheme in each Mapudungun word and assigning a meaning to each iden-tified morpheme a rule-based MT system can translate in-dividually each piece of meaning

We had found few data resources available for Ma-pudungun So we were not surprised to learn that no mor-phological analyzer for Mapudungun had ever been writ-ten Necessarily then we developed our own morphologi-cal analyzer Using the stem lexicon we had created and reference grammars of Mapudungun our morphological analyzer takes a Mapudungun word as input and as output produces all possible morphological analyses of the word Each analysis specifies

A single stem in that word

Each suffix in that word

A syntactic analy-sis for the stem and each identified suffix

To identify the morphemes (stem and suffixes) in a Mapudungun word a lexicon of stems works together with a fairly complete lexicon of Mapudungun suffixes The current version of the stem lexicon contains XXX stems and their Spanish translations Each entry in this lexicon lists the part of speech of the stem as well as other features associated with the stem such as lexical aspect in the case of verb stems The suffix lexicon built by hand by computational linguists on the AVENUE team contains all the uncontroversial inflectional morphemes of Ma-pudungun There are XXX Mapudungun suffixes in the suffix lexicon Each suffix lists the part of speech that the suffix attaches to as well as the linguistic features such as person number mood etc that it marks The morpho-logical analyzer performs a recursive and exhaustive search on all possible segmentations of a given Mapudun-gun word The analyzer starts from the beginning of the word and identifies and removes each stem that is an ini-tial string in that word The current morphological ana-lyzer does not handle noun incorporation a fairly rare phenomenon in Mapudungun The analyzer then examines the remaining string looking for a valid combination of suffixes that could complete the word A suffix combina-tion is valid if Each of the

suffixes in the combination can attach to stems with the part of speech of the current stem and

The suffixes occur in a grammatical order

Mapudungun morphology is relatively unambiguous Most Mapudungun words for which the stem is in the stem lexicon receive a single analysis A few truly am-biguous suffix combinations may cause a Mapudungun word to receive more than one distinct analysis

Once the morphological analyzer has found all possi-ble segmentations of a word it combines the feature infor-mation from the stem and the suffixes encountered in the analyzed word to create a syntactic analysis that is re-turned For an example see Figure 3

222 Two Machine Translation SystemsWith the basic data and computational resources built

for Mapudungun the AVENUE team could proceed toward constructing machine translation systems from Mapudun-gun into Spanish We have developed two prototype Ma-pudungun-Spanish MT systems The first MT system is a data driven paradigm called Example-Based MT An EBMT system can exploit the decent sized spoken corpus of Mapudungun as a source of data to produce translations with higher lexical coverage but lower translation accu-racy The second MT system is a hand-built Rule-Based MT system Building an MT system by hand is feasible only because the AVENUE team has computational lin-guists available as a resource The human-developed Rule-Based MT system complements the EBMT system with high quality translations of specific syntactic con-structions over a smaller vocabulary

2221 An Example-Based Machine Translation Sys-tem

Example-Based Machine Translation (EBMT) relies on previous translations performed by humans to create new translations without the need for human translators The previous translations are called the training corpus For the best translation quality the training corpus should be as large as possible and as similar to the text to be translated as possible When the exact sentence to be translated occurs in the training material the translation quality is human-level because the previous translation is re-used As the source sentence differs more and more from the training material quality decreases because smaller and smaller fragments must be combined to pro-duce the translation increasing the chances of an incorrect translation

As the amount of training material decreases so does the translation quality because there are fewer long matches between the training texts and the input to be translated Conversely more training data can be added at any time improving the systems performance by allow-ing more and longer matches EBMT usually finds only partial matches which generate lower-quality translations Further EBMT searches for phrases of two or more words and thus there can be portions of the input which do not produce phrasal translations For unmatched por-tions of the input EBMT falls back on a probabilistic lexi-con trained from the corpus to produce word-for-word translations This fall-back approach provides some trans-lation (even though of lower quality) for any word form encountered in the training data While the translation quality of an EBMT system can be human-level when long phrases or entire sentence are matched any mistakes in the human translations used for trainingmdashspelling er-rors omissions mistranslationsmdashwill become visible in the EBMT systems output Thus it is important that the training data be as accurate as possible The training cor-pus we use for EBMT is the spoken language corpus de-scribed earlier As discussed in section the corpus of spoken Mapudungun contains some errors and awkward translations

cmonson 112506
fix
cmonson 112206
Get from previous paper
cmonson 111606
How many
cmonson 111606
How many
aria 181206
It might be better to say However no morphologicalhellip so we developed our own

Highly agglutinative languages like Mapudungun pose a challenge for Example Based MT Because there are so many inflected versions of each stem most inflected words are rare If the rare words do not occur in the cor-pus at all they will not be translatable by EBMT If they occur only a few times it will also be hard for EBMT to have accurate statistics about how they are used Addi-tionally automatic word-level alignment between the two languages which is used to determine the appropriate translation of a partial match becomes more difficult when individual words in one language correspond to many words in the other language We address both issues by using the morphological analyzer for Mapudungun to split words into individual stems and suffixes Alone each stem and suffix is more common than the original combi-nation of stem and suffixes and the separate morphemes are more likely to map to single words in Spanish

We currently have a prototype EBMT system trained on approximately 204000 sentence pairs from the corpus of spoken Mapudungun containing a total of 125 million Mapudungun tokens after splitting the original words into stems and one or more suffixes This separation increased BLEU scores (Papineni et al 2001) on a held out portion of the speech corpus by 548 from 01530 to 01614 We expect further improvements in MT accuracy from more detailed use of morphological analysis the inclusion of common phrases in the corpus and from repairing translation errors and awkward translations in the corpus

2222 A Rule-Based MT systemThe second MT system the AVENUE team has devel-

oped is a Rule-Based system where the rules have been built by hand Manually developing a Rule-based MT (RBMT) system involves detailed comparative analysis of the grammar of source and target languages Manually im-buing an MT system with detailed syntactic knowledge can produce high quality translation but typically takes a significant amount of time to implement Despite the time requirements developing a Rule-Based MT system for Mapudungun by hand is a natural fit in our case for two reasons First the AVENUE project has the necessary hu-

man expertise available to hand-develop an MT system with significant syntactic understanding And second hand-built rule-based MT nicely complements the weak-nesses of EBMT EBMT focuses on high translation cov-erage at the expense of accuracy in contrast the rule-based MT system does not exhaustively cover Mapudun-gun grammar but what rules we wrote cover a portion of Mapudungun with high accuracy

At a high level the AVENUE Rule-Based MT frame-work is composed of three modules a source language morphological analyzer the syntactic rule-based transfer system itself and a target language morphological genera-tor The source language morphological analyzer pro-cesses the raw input source sentence into a lattice of all possible morphological segmentations The transfer sys-tem maps the morphologically analyzed source sentences into target language expressions via a transfer lexicon and grammar The transfer grammar and lexicon contain syn-tactic and lexical rules that explicitly refer to morphosyn-tactic features of the two languages (more details are given in the next subsections) The output of the transfer system is a target language expression composed of unin-flected words bundled with grammatical features It is then the morphological generatorrsquos job to massage the un-inflected word-feature bundles into fully inflected target language output sentences

In the Mapudungun-Spanish Rule-Based MT system the source language morphological analyzer is the Ma-pudungun morphological analyzer described earlier in section The Mapudungun to Spanish transfer gram-mar and lexicon are described in section While the Spanish morphological generator is described in section But before delving into details of the Ma-pudungun-Spanish rule-based MT system a short over-view of the AVENUE rule-based MT framework is needed

2223 Overview of Transfer Rule Syntax and the Run-time Transfer System

This paper does not describe in detail the rationale be-hind either the transfer rules used in the AVENUE Rule-Based MT system or the run-time MT transfer engine

Figure A simplified hand-written AVENUE style transfer rule This rule transfers Mapudungun nominal expressions that mark for number into Spanish nominal expressions Where Spanish marks number morphologically on the head noun Mapudungun uses a separate particle to mark number on nouns

NBar1

NbarNbar [PART N] -gt [N]

((X2Y1)

((X1 number) =c pl)((X0 number) = (X1 number))

((Y0 number) = (Y1 number))((Y0 gender) = (Y1 gender))

((Y0 number) = (X0 number)))1) Rule identifier

2) SL and TL Constituent Structure Specification

3) SL (X2) to TL (Y1) Constituent Alignment

4) SL Constraint This rule applies only if SL number is pl4) SL Constraint SL number is passed up to the SL NBar

5) TL Constraint TL number is passed up to the TL NBar5) TL Constraint TL gender is passed up to the TL NBar

6) Transfer Equation At this point the number on the TL NBar must equal the number on the SL NBar

cmonson 112706
fix ref
cmonson 112706
fix ref
cmonson 112706
fix ref
aria 181206
We could mention that the MEMT approach (cite paper) can combine the benefits of both (at least in theory)

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 3: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

how these constraints led us to build particular MT sys-tems Error Reference source not found summarizes the available resources for each low-resource language At a glance with the possible exception of Hindi the AVENUE project had better access to human resources than to com-putational resources

The first column in Error Reference source not found which notes the availability of computational linguists with knowledge of machine translation techniques is pre-requisite to developing MT systems for any language The third column in Error Reference source not found is also fully filled in because the AVENUE project had access to bilingual speakers for each of our translation pairs

On the other hand for any particular resource-poor language there are few computational linguists with de-tailed knowledge of that language The AVENUE project elected to build a Hebrew MT system specifically because of project internal expertise on Hebrew Similarly we chose to work with the South American indigenous lan-guages Mapudungun and Quechua only after it became clear we would have some language specific expertise on each The AVENUE project specifically cultivated relation-ships with government civic linguistic and educational organizations of the native speaking populations of sev-eral Native American languages It is through these con-tacts that the AVENUE project gained computational lin-guistic expertise in Mapudungun and Quechua And after the AVENUE project had already begun work on Mapudun-gun a computational linguist with field work experience in Mapudungun joined our team Similarly another mem-ber of the AVENUE project visited Cuzco Peru to learn Quechua for three months Unlike Hebrew Mapudungun and Quechua the AVENUE team did not elect to work on Hindi The task of building an MT system for Hindi was presented to us as the surprise language portion of the DARPA XX project Hence no computational linguist on our team speaks Hindi Fortunately we had access to fairly significant computational resources for Hindi

As noted computational resources for low-resource languages are more difficult to come by than for resource-rich languages Only Hindi had either of an easily avail-able ready-made morphological analyzer or a significant amount of parallel sentence aligned data For Hebrew we were able to adapt a freely available morphological ana-lyzer while for Mapudungun and Quechua we built mor-phological analysis tools from scratch To our knowledge there is no previously existing Mapudungun morphologic-al analyzermdashhence we were forced to build our own And with the effort we had put into building a morphological analyzer for Mapudungun together with our acquired hu-man knowledge resources we were able to quickly create a basic morphological analyzer for Quechua

None of the four low resource languages had lexicons adequate for machine translation Sections X describe in detail the creation and adaptation of lexical resources for each individual language For two of the languages Hindi and Mapudungun the AVENUE project had access to rea-sonably sized sentence aligned parallel corpora Through participation in the DARPA surprise language exercise on Hindi the AVENUE project gained access to a large sen-tence aligned Hindi-English parallel corpus In contrast a significant effort within the AVENUE project as detailed in section X was devoted to creating transcribing and translating a bilingual Mapudungun-English Spanish speech corpus And rounding out the computational re-sources the elicitation corpus is a unique data resource in that it is created in the resource-rich language and then translated by bilingual speakers into the resource-poor language Because of the emphasis within the AVENUE project on collaboration with native speakers the elicita-tion corpus was available as a data resource in each of the four languages

Human ResourcesComputational ResourcesComputational Linguist with Knowledge ofBilingual SpeakerProcessingDataMTLanguage PairMorphological AnalysisLexiconParallel CorpusElicitation CorpusMapudungunXXXCreatedCreatedCreatedXQuechuaXLearnedXCreatedCreatedXHebrewXXXAdaptedAdaptedXHindiXExternal consultantLearnedXXCreated AdaptedXX

Table 2 Different languages with different available resources This table presents and contrasts the relevant available resources for the four low-resource languages discussed in this paper

cmonson 072406
Add section ref
cmonson 072406
Add reference to appropriate sections
cmonson 072506
Lori Are there other Quechua morphological analyzers out there that we should mention
aria 181206
Richard Cohen from Pitt acted as our Hindi expert and I wrote a manual grammar from what he taught us during a couple of sections So I am not sure what we should have in Table 2 but leaving it blank does not reflect the real situation
cmonson 102406
fix
cmonson 102706
Ackward

Error Reference source not found lists the MT sys-tems the AVENUE project developed for each of the four low-resource languages Within the AVENUE project the available resources dictate which MT approaches are fea-sible for a given language pair Corpus based approaches to MT were infeasible for the Hebrew to English and the Quechua to Spanish language pairs because we lacked sentence-aligned bilingual corpora between these lan-guages Similarly because the AVENUE team lacked first-hand knowledge of Hindi only a small hand-written rule-based MT system was developed for Hindi to English MT In addition to resource availability constraints time constraints have also affected which MT systems we built for each language Of the two corpus based approaches to MT we have only applied example-based MT to the task of Mapudungun to Spanish translation Similarly we have not yet employed the rule-learning approach to build MT systems for Mapudungun to Spanish or for Quechua to Spanish Applying the AVENUE developed automatic syn-tactic rule learning method of MT to Mapudungun and Quechua is high on our list of next steps

2 Four MT SystemsOver the past six years the AVENUE project at the Lan-

guage Technologies Institute at Carnegie Mellon Univer-sity has worked to produce basic MT systems for a num-ber of resource-poor languages Chief among our accom-plishments is the construction of MT systems for four lan-guages for which few or no MT systems have previously been built Mapudungun Quechua Hebrew and Hindi Mapudungun is the smallest of these four languages Less than 1 million people speak Mapudungun in southern Chile and Argentina Quechua a second indigenous lan-guage of South America is more widely spoken There are approximately 10 million speakers of Quechua in Peru Bolivia Ecuador southern Colombia and northern Argentina Hebrew is spoken by approximately 7 million people in Israel and Hindi is the largest of the four lan-guages for which we built MT systems There are several hundred million native and secondary speakers of Hindi in India and abroad Despite the large number of Hindi speakers Hindi is still resource-poor with only rudimen-tary natural language processing data and tools freely ob-tainable

Building MT systems for resource-poor languages is challenging Section contrasted the availability (and unavailability) of resources for each of our four resource-poor languages In sections X-X we consider each lan-guage in turnmdashspecifically illustrating how an omnivo-rous approach to MT allowed the AVENUE team to over-come the specific resource gaps of each language to build MT systems Finally in Section X we synthesize the lessons and experience we gained in building these ma-chine translation systems and present overall suggestions for others looking to build MT for languages with scarce computational resources

21 Two Indigenous South American Lan-guages

The omnivorous approach to MT within the AVENUE project was originally developed to tackle challenges we faced in developing MT for indigenous South American Languages under NSF grant Omnivorous MT re-quired close collaboration with native speaking popula-tions The AVENUE project relies on native bilingual speakers to provide crucial information toward building MT systems After considering several candidate lan-guages the AVENUE team elected to work with the lan-guages of Mapudungun and Quechua as they promised the closest collaboration with native language organiza-tions The details of the collaborations with the native or-ganizations are presented in sections X and X

Even with a strong working relationship with native speakers of Mapudungun and Quechua the AVENUE team faced serious hurdles to creating functional MT for these languages At the time the AVENUE team started working on Mapudungun even simple natural language tools such as morphological analyzers or spelling correctors did not exist In fact there were few electronic resources from which such natural language tools might be built There were no standard Mapudungun text or speech corpora or lexicons and no parsed treebanks The text that did exist was in a variety of competing orthographic formats More resources exist for Quechua but they are still far from what is needed to build a complete MT system

Mapudungun and Quechua share challenges for MT creation beyond simple resource scarcity however These additional challenges stem from both Mapudungun and

Corpus Based MTRule-Based MTExample-BasedStatisticalHand WrittenLearnedMapudungunXXQuechuaXHebrewXXHindiXXXsmallX

Table 3 Languages vs MT systems Depending on the available resources the AVENUE project built a variety of MT systems for each language

Katharina Probst 101706
fill in
Katharina Probst 101706
add number
Katharina Probst 101706
Fill in
Katharina Probst 101706
Fill in
Katharina Probst 101206
Which section
aria 181206
I donrsquot think so since the Quechua grammar is also rather small 25 Gr rules (643 stem lexical entries and 36 suffixes) I think the Hindi grammar had something like 70 rules in the grammar Maybe table 3 could contain the number of rules instead of just a X
cmonson 121806
Is this fair to say The Quechua MT system is pretty small toohellip Ari Yes (see my comment below)

Quechua being indigenous South American languages First both Mapudungun and Quechua pose unique chal-lenges from a linguistic theory perspective Although Ma-pudungun and Quechua are unrelated languages they both have complex agglutinative morphological structures In addition Mapudungun is polysynthetic incorporating ob-jects into the verb of a sentence Agglutination and polysynthesis are both linguistic properties that the major-ity of languages for which natural language resources have been built do not possess Second human factors also pose a particular challenge for these two languages Specifically there is a scarcity of people trained in com-putational linguistics who are native speakers or who have a good knowledge of these indigenous languages And fi-nally an often overlooked challenge that confronts devel-opment of NLP tools for resource-scarce languages is the divergence of the culture of the native speaker population from the western culture of computational linguistics

Despite the challenges facing the development of natu-ral language processing systems for Mapudungun and Quechua the AVENUE team has developed a suite of basic language resources for each of these languages and has then leveraged these resources into more sophisticated natural language processing tools The AVENUE project led a collaborative group of Mapudungun speakers in first collecting and then transcribing and translating into Span-ish by far the largest spoken corpus of Mapudungun avail-able From this corpus we then built a spelling checker and a morphological analyzer for Mapudungun For Quechua we have created parallel and aligned data as well as a basic bilingual lexicon with morphological infor-mation And with these resources we have built two proto-type machine translation (MT) systems for Mapudungun and one prototype MT system for Quechua The following sections detail the construction of these resources focus-ing on overcoming the specific challenges Mapudungun and Quechua each present as resource-scarce languages

22 MAPUDUNGUNSince May of 2000 in an effort to ultimately produce

Mapudungun-Spanish machine translation computational linguists at CMUrsquos Language Technologies Institute have collaborated with Mapudungun language experts at the In-stituto de Estudios Indigenas (IEI - Institute for Indige-nous Studies) at the Universidad de la Frontera (UFRO) in Chile and with the Bilingual and Multicultural Education Program of the Ministry of Education (Mineduc) in Chile (Levin et al 2000)

From the very outset of our collaboration we battled with the scarcity of electronic resources for Mapudungun The first phase of the AVENUE collaboration was to collect and produce parallel Mapudungun-Spanish language data from which higher-level language processing tools and systems could be built One barrier we faced in the collec-tion of Mapudungun language data is that there are cur-rently several competing orthographic conventions for written Mapudungun Early in the AVENUE Mapudungun collaboration the IEI-UFRO team established a set of or-thographic conventions All the data collected under the AVENUE project conforms to this set of orthographic con-ventions If the Mapudungun data was originally in some other orthographic convention then we manually con-verted it into our own orthography Recently however a different orthography Azuumlmchefi has been chosen by the

Chilean government for official documents Portions of our data have been automatically converted into Azuumlm-chefi using automatic substitution rules

221 Corpora and LexiconsRecognizing the appetite of MT systems for data re-

sources and the scarcity of Mapudungun language data the AVENUE team began collecting Mapudungun data soon after our collaboration began with IEI and UFRO in 2000 Directed from CMU and conducted by native speakers of Mapudungun at the UFRO in Temuco Chile our data col-lection efforts ultimately resulted in three separate cor-pora 1) a small parallel Mapudungun-Spanish corpus of historical texts and newspaper text 2) 1700 Spanish sen-tences that were manually translated and aligned into Ma-pudungun These 1700 Spanish sentences were carefully crafted to elicit specific syntactic phenomena in Ma-pudungun and 3) a parallel corpus consisting of 170 hours of transcribed and translated Mapudungun speech Natural language translation is challenging even for hu-mans particularly when the source text to translate is spo-ken language with its many disfluencies and colloquial expressions The translation of the speech corpus was completed by bilingual speakers of Mapudungun and Spanish but not by professional translators who attempted to translate as literally as possible Furthermore each sec-tion of the speech corpus was only ever viewed by a sin-gle translator All of these translation challenges combine to render the Spanish translation of the Mapudungun text imperfect Still a bilingual corpus is a valuable resource Additional details on corpora are described in Monson et al 2004

While small the second corpus listed above the struc-tural elicitation corpus has aided our development of the hand-built rule-based MT system for Mapudungun (see section XX) The larger spoken corpus was put to two specific uses to develop MT for Mapudungun First the transcribed and translated spoken corpus was used as a parallel corpus to develop an Example-Based MT system described further in section XX Second the AVENUE team used the spoken corpus to derive a bilingual stem lexicon for use in Rule-Based MT systems

To obtain the bilingual stem lexicon the AVENUE team first converted the transcribed spoken corpus text into a monolingual Mapudungun word list The first 117003 most frequent fully-inflected Mapudungun word forms were hand-checked for spelling according to the adopted orthographic conventions for Mapudungun And 15120 of the most frequent spelling corrected fully inflected word forms were hand segmented into two parts The first part consisting of the stem of the word and the second part consisting of one or more suffixes This produced 5234 stems and 1303 suffix groups These 5234 stems were then translated into Spanish

2211 Mapudungun morphological analyzerIn addition to consuming data resources the MT sys-

tems used in the AVENUE project consume the computa-tional resource of morphological analysis Mapudungun is an agglutinative and polysynthetic language Figure 2 contains glosses of a few morphologically complex Ma-pudungun verbs that occur in our corpus of spoken Ma-pudungun Monson et al documents the large num-ber of occuring Mapudungun word forms in this speech corpus A typical complex verb form in our spoken corpus

cmonson 112206
Fix ref
cmonson 112206
Take this figure from a previous paper
cmonson 112206
add further whittling down of stems due to Robertorsquos more stringent criteria
cmonson 111506
add section
cmonson 112506
Is this fair Clean up this text
cmonson 111506
If this is true then why was orthography a concern in running Robertorsquos MT system on the newspaper data

consists of five or six morphemes Agglutinative morpho-logical processes explode the number of possible surface word forms while simultaneously causing individual sur-face word forms to be rare Furthermore each morpheme may alone contain several morphosyntactic features Hence it is difficult for an MT system to translate Ma-pudungun words directly into another language Analyz-ing the morphology of Mapudungun allows our MT sys-tems to cover a larger fraction of all Mapudungun surface word forms And by identifying each morpheme in each Mapudungun word and assigning a meaning to each iden-tified morpheme a rule-based MT system can translate in-dividually each piece of meaning

We had found few data resources available for Ma-pudungun So we were not surprised to learn that no mor-phological analyzer for Mapudungun had ever been writ-ten Necessarily then we developed our own morphologi-cal analyzer Using the stem lexicon we had created and reference grammars of Mapudungun our morphological analyzer takes a Mapudungun word as input and as output produces all possible morphological analyses of the word Each analysis specifies

A single stem in that word

Each suffix in that word

A syntactic analy-sis for the stem and each identified suffix

To identify the morphemes (stem and suffixes) in a Mapudungun word a lexicon of stems works together with a fairly complete lexicon of Mapudungun suffixes The current version of the stem lexicon contains XXX stems and their Spanish translations Each entry in this lexicon lists the part of speech of the stem as well as other features associated with the stem such as lexical aspect in the case of verb stems The suffix lexicon built by hand by computational linguists on the AVENUE team contains all the uncontroversial inflectional morphemes of Ma-pudungun There are XXX Mapudungun suffixes in the suffix lexicon Each suffix lists the part of speech that the suffix attaches to as well as the linguistic features such as person number mood etc that it marks The morpho-logical analyzer performs a recursive and exhaustive search on all possible segmentations of a given Mapudun-gun word The analyzer starts from the beginning of the word and identifies and removes each stem that is an ini-tial string in that word The current morphological ana-lyzer does not handle noun incorporation a fairly rare phenomenon in Mapudungun The analyzer then examines the remaining string looking for a valid combination of suffixes that could complete the word A suffix combina-tion is valid if Each of the

suffixes in the combination can attach to stems with the part of speech of the current stem and

The suffixes occur in a grammatical order

Mapudungun morphology is relatively unambiguous Most Mapudungun words for which the stem is in the stem lexicon receive a single analysis A few truly am-biguous suffix combinations may cause a Mapudungun word to receive more than one distinct analysis

Once the morphological analyzer has found all possi-ble segmentations of a word it combines the feature infor-mation from the stem and the suffixes encountered in the analyzed word to create a syntactic analysis that is re-turned For an example see Figure 3

222 Two Machine Translation SystemsWith the basic data and computational resources built

for Mapudungun the AVENUE team could proceed toward constructing machine translation systems from Mapudun-gun into Spanish We have developed two prototype Ma-pudungun-Spanish MT systems The first MT system is a data driven paradigm called Example-Based MT An EBMT system can exploit the decent sized spoken corpus of Mapudungun as a source of data to produce translations with higher lexical coverage but lower translation accu-racy The second MT system is a hand-built Rule-Based MT system Building an MT system by hand is feasible only because the AVENUE team has computational lin-guists available as a resource The human-developed Rule-Based MT system complements the EBMT system with high quality translations of specific syntactic con-structions over a smaller vocabulary

2221 An Example-Based Machine Translation Sys-tem

Example-Based Machine Translation (EBMT) relies on previous translations performed by humans to create new translations without the need for human translators The previous translations are called the training corpus For the best translation quality the training corpus should be as large as possible and as similar to the text to be translated as possible When the exact sentence to be translated occurs in the training material the translation quality is human-level because the previous translation is re-used As the source sentence differs more and more from the training material quality decreases because smaller and smaller fragments must be combined to pro-duce the translation increasing the chances of an incorrect translation

As the amount of training material decreases so does the translation quality because there are fewer long matches between the training texts and the input to be translated Conversely more training data can be added at any time improving the systems performance by allow-ing more and longer matches EBMT usually finds only partial matches which generate lower-quality translations Further EBMT searches for phrases of two or more words and thus there can be portions of the input which do not produce phrasal translations For unmatched por-tions of the input EBMT falls back on a probabilistic lexi-con trained from the corpus to produce word-for-word translations This fall-back approach provides some trans-lation (even though of lower quality) for any word form encountered in the training data While the translation quality of an EBMT system can be human-level when long phrases or entire sentence are matched any mistakes in the human translations used for trainingmdashspelling er-rors omissions mistranslationsmdashwill become visible in the EBMT systems output Thus it is important that the training data be as accurate as possible The training cor-pus we use for EBMT is the spoken language corpus de-scribed earlier As discussed in section the corpus of spoken Mapudungun contains some errors and awkward translations

cmonson 112506
fix
cmonson 112206
Get from previous paper
cmonson 111606
How many
cmonson 111606
How many
aria 181206
It might be better to say However no morphologicalhellip so we developed our own

Highly agglutinative languages like Mapudungun pose a challenge for Example Based MT Because there are so many inflected versions of each stem most inflected words are rare If the rare words do not occur in the cor-pus at all they will not be translatable by EBMT If they occur only a few times it will also be hard for EBMT to have accurate statistics about how they are used Addi-tionally automatic word-level alignment between the two languages which is used to determine the appropriate translation of a partial match becomes more difficult when individual words in one language correspond to many words in the other language We address both issues by using the morphological analyzer for Mapudungun to split words into individual stems and suffixes Alone each stem and suffix is more common than the original combi-nation of stem and suffixes and the separate morphemes are more likely to map to single words in Spanish

We currently have a prototype EBMT system trained on approximately 204000 sentence pairs from the corpus of spoken Mapudungun containing a total of 125 million Mapudungun tokens after splitting the original words into stems and one or more suffixes This separation increased BLEU scores (Papineni et al 2001) on a held out portion of the speech corpus by 548 from 01530 to 01614 We expect further improvements in MT accuracy from more detailed use of morphological analysis the inclusion of common phrases in the corpus and from repairing translation errors and awkward translations in the corpus

2222 A Rule-Based MT systemThe second MT system the AVENUE team has devel-

oped is a Rule-Based system where the rules have been built by hand Manually developing a Rule-based MT (RBMT) system involves detailed comparative analysis of the grammar of source and target languages Manually im-buing an MT system with detailed syntactic knowledge can produce high quality translation but typically takes a significant amount of time to implement Despite the time requirements developing a Rule-Based MT system for Mapudungun by hand is a natural fit in our case for two reasons First the AVENUE project has the necessary hu-

man expertise available to hand-develop an MT system with significant syntactic understanding And second hand-built rule-based MT nicely complements the weak-nesses of EBMT EBMT focuses on high translation cov-erage at the expense of accuracy in contrast the rule-based MT system does not exhaustively cover Mapudun-gun grammar but what rules we wrote cover a portion of Mapudungun with high accuracy

At a high level the AVENUE Rule-Based MT frame-work is composed of three modules a source language morphological analyzer the syntactic rule-based transfer system itself and a target language morphological genera-tor The source language morphological analyzer pro-cesses the raw input source sentence into a lattice of all possible morphological segmentations The transfer sys-tem maps the morphologically analyzed source sentences into target language expressions via a transfer lexicon and grammar The transfer grammar and lexicon contain syn-tactic and lexical rules that explicitly refer to morphosyn-tactic features of the two languages (more details are given in the next subsections) The output of the transfer system is a target language expression composed of unin-flected words bundled with grammatical features It is then the morphological generatorrsquos job to massage the un-inflected word-feature bundles into fully inflected target language output sentences

In the Mapudungun-Spanish Rule-Based MT system the source language morphological analyzer is the Ma-pudungun morphological analyzer described earlier in section The Mapudungun to Spanish transfer gram-mar and lexicon are described in section While the Spanish morphological generator is described in section But before delving into details of the Ma-pudungun-Spanish rule-based MT system a short over-view of the AVENUE rule-based MT framework is needed

2223 Overview of Transfer Rule Syntax and the Run-time Transfer System

This paper does not describe in detail the rationale be-hind either the transfer rules used in the AVENUE Rule-Based MT system or the run-time MT transfer engine

Figure A simplified hand-written AVENUE style transfer rule This rule transfers Mapudungun nominal expressions that mark for number into Spanish nominal expressions Where Spanish marks number morphologically on the head noun Mapudungun uses a separate particle to mark number on nouns

NBar1

NbarNbar [PART N] -gt [N]

((X2Y1)

((X1 number) =c pl)((X0 number) = (X1 number))

((Y0 number) = (Y1 number))((Y0 gender) = (Y1 gender))

((Y0 number) = (X0 number)))1) Rule identifier

2) SL and TL Constituent Structure Specification

3) SL (X2) to TL (Y1) Constituent Alignment

4) SL Constraint This rule applies only if SL number is pl4) SL Constraint SL number is passed up to the SL NBar

5) TL Constraint TL number is passed up to the TL NBar5) TL Constraint TL gender is passed up to the TL NBar

6) Transfer Equation At this point the number on the TL NBar must equal the number on the SL NBar

cmonson 112706
fix ref
cmonson 112706
fix ref
cmonson 112706
fix ref
aria 181206
We could mention that the MEMT approach (cite paper) can combine the benefits of both (at least in theory)

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 4: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Error Reference source not found lists the MT sys-tems the AVENUE project developed for each of the four low-resource languages Within the AVENUE project the available resources dictate which MT approaches are fea-sible for a given language pair Corpus based approaches to MT were infeasible for the Hebrew to English and the Quechua to Spanish language pairs because we lacked sentence-aligned bilingual corpora between these lan-guages Similarly because the AVENUE team lacked first-hand knowledge of Hindi only a small hand-written rule-based MT system was developed for Hindi to English MT In addition to resource availability constraints time constraints have also affected which MT systems we built for each language Of the two corpus based approaches to MT we have only applied example-based MT to the task of Mapudungun to Spanish translation Similarly we have not yet employed the rule-learning approach to build MT systems for Mapudungun to Spanish or for Quechua to Spanish Applying the AVENUE developed automatic syn-tactic rule learning method of MT to Mapudungun and Quechua is high on our list of next steps

2 Four MT SystemsOver the past six years the AVENUE project at the Lan-

guage Technologies Institute at Carnegie Mellon Univer-sity has worked to produce basic MT systems for a num-ber of resource-poor languages Chief among our accom-plishments is the construction of MT systems for four lan-guages for which few or no MT systems have previously been built Mapudungun Quechua Hebrew and Hindi Mapudungun is the smallest of these four languages Less than 1 million people speak Mapudungun in southern Chile and Argentina Quechua a second indigenous lan-guage of South America is more widely spoken There are approximately 10 million speakers of Quechua in Peru Bolivia Ecuador southern Colombia and northern Argentina Hebrew is spoken by approximately 7 million people in Israel and Hindi is the largest of the four lan-guages for which we built MT systems There are several hundred million native and secondary speakers of Hindi in India and abroad Despite the large number of Hindi speakers Hindi is still resource-poor with only rudimen-tary natural language processing data and tools freely ob-tainable

Building MT systems for resource-poor languages is challenging Section contrasted the availability (and unavailability) of resources for each of our four resource-poor languages In sections X-X we consider each lan-guage in turnmdashspecifically illustrating how an omnivo-rous approach to MT allowed the AVENUE team to over-come the specific resource gaps of each language to build MT systems Finally in Section X we synthesize the lessons and experience we gained in building these ma-chine translation systems and present overall suggestions for others looking to build MT for languages with scarce computational resources

21 Two Indigenous South American Lan-guages

The omnivorous approach to MT within the AVENUE project was originally developed to tackle challenges we faced in developing MT for indigenous South American Languages under NSF grant Omnivorous MT re-quired close collaboration with native speaking popula-tions The AVENUE project relies on native bilingual speakers to provide crucial information toward building MT systems After considering several candidate lan-guages the AVENUE team elected to work with the lan-guages of Mapudungun and Quechua as they promised the closest collaboration with native language organiza-tions The details of the collaborations with the native or-ganizations are presented in sections X and X

Even with a strong working relationship with native speakers of Mapudungun and Quechua the AVENUE team faced serious hurdles to creating functional MT for these languages At the time the AVENUE team started working on Mapudungun even simple natural language tools such as morphological analyzers or spelling correctors did not exist In fact there were few electronic resources from which such natural language tools might be built There were no standard Mapudungun text or speech corpora or lexicons and no parsed treebanks The text that did exist was in a variety of competing orthographic formats More resources exist for Quechua but they are still far from what is needed to build a complete MT system

Mapudungun and Quechua share challenges for MT creation beyond simple resource scarcity however These additional challenges stem from both Mapudungun and

Corpus Based MTRule-Based MTExample-BasedStatisticalHand WrittenLearnedMapudungunXXQuechuaXHebrewXXHindiXXXsmallX

Table 3 Languages vs MT systems Depending on the available resources the AVENUE project built a variety of MT systems for each language

Katharina Probst 101706
fill in
Katharina Probst 101706
add number
Katharina Probst 101706
Fill in
Katharina Probst 101706
Fill in
Katharina Probst 101206
Which section
aria 181206
I donrsquot think so since the Quechua grammar is also rather small 25 Gr rules (643 stem lexical entries and 36 suffixes) I think the Hindi grammar had something like 70 rules in the grammar Maybe table 3 could contain the number of rules instead of just a X
cmonson 121806
Is this fair to say The Quechua MT system is pretty small toohellip Ari Yes (see my comment below)

Quechua being indigenous South American languages First both Mapudungun and Quechua pose unique chal-lenges from a linguistic theory perspective Although Ma-pudungun and Quechua are unrelated languages they both have complex agglutinative morphological structures In addition Mapudungun is polysynthetic incorporating ob-jects into the verb of a sentence Agglutination and polysynthesis are both linguistic properties that the major-ity of languages for which natural language resources have been built do not possess Second human factors also pose a particular challenge for these two languages Specifically there is a scarcity of people trained in com-putational linguistics who are native speakers or who have a good knowledge of these indigenous languages And fi-nally an often overlooked challenge that confronts devel-opment of NLP tools for resource-scarce languages is the divergence of the culture of the native speaker population from the western culture of computational linguistics

Despite the challenges facing the development of natu-ral language processing systems for Mapudungun and Quechua the AVENUE team has developed a suite of basic language resources for each of these languages and has then leveraged these resources into more sophisticated natural language processing tools The AVENUE project led a collaborative group of Mapudungun speakers in first collecting and then transcribing and translating into Span-ish by far the largest spoken corpus of Mapudungun avail-able From this corpus we then built a spelling checker and a morphological analyzer for Mapudungun For Quechua we have created parallel and aligned data as well as a basic bilingual lexicon with morphological infor-mation And with these resources we have built two proto-type machine translation (MT) systems for Mapudungun and one prototype MT system for Quechua The following sections detail the construction of these resources focus-ing on overcoming the specific challenges Mapudungun and Quechua each present as resource-scarce languages

22 MAPUDUNGUNSince May of 2000 in an effort to ultimately produce

Mapudungun-Spanish machine translation computational linguists at CMUrsquos Language Technologies Institute have collaborated with Mapudungun language experts at the In-stituto de Estudios Indigenas (IEI - Institute for Indige-nous Studies) at the Universidad de la Frontera (UFRO) in Chile and with the Bilingual and Multicultural Education Program of the Ministry of Education (Mineduc) in Chile (Levin et al 2000)

From the very outset of our collaboration we battled with the scarcity of electronic resources for Mapudungun The first phase of the AVENUE collaboration was to collect and produce parallel Mapudungun-Spanish language data from which higher-level language processing tools and systems could be built One barrier we faced in the collec-tion of Mapudungun language data is that there are cur-rently several competing orthographic conventions for written Mapudungun Early in the AVENUE Mapudungun collaboration the IEI-UFRO team established a set of or-thographic conventions All the data collected under the AVENUE project conforms to this set of orthographic con-ventions If the Mapudungun data was originally in some other orthographic convention then we manually con-verted it into our own orthography Recently however a different orthography Azuumlmchefi has been chosen by the

Chilean government for official documents Portions of our data have been automatically converted into Azuumlm-chefi using automatic substitution rules

221 Corpora and LexiconsRecognizing the appetite of MT systems for data re-

sources and the scarcity of Mapudungun language data the AVENUE team began collecting Mapudungun data soon after our collaboration began with IEI and UFRO in 2000 Directed from CMU and conducted by native speakers of Mapudungun at the UFRO in Temuco Chile our data col-lection efforts ultimately resulted in three separate cor-pora 1) a small parallel Mapudungun-Spanish corpus of historical texts and newspaper text 2) 1700 Spanish sen-tences that were manually translated and aligned into Ma-pudungun These 1700 Spanish sentences were carefully crafted to elicit specific syntactic phenomena in Ma-pudungun and 3) a parallel corpus consisting of 170 hours of transcribed and translated Mapudungun speech Natural language translation is challenging even for hu-mans particularly when the source text to translate is spo-ken language with its many disfluencies and colloquial expressions The translation of the speech corpus was completed by bilingual speakers of Mapudungun and Spanish but not by professional translators who attempted to translate as literally as possible Furthermore each sec-tion of the speech corpus was only ever viewed by a sin-gle translator All of these translation challenges combine to render the Spanish translation of the Mapudungun text imperfect Still a bilingual corpus is a valuable resource Additional details on corpora are described in Monson et al 2004

While small the second corpus listed above the struc-tural elicitation corpus has aided our development of the hand-built rule-based MT system for Mapudungun (see section XX) The larger spoken corpus was put to two specific uses to develop MT for Mapudungun First the transcribed and translated spoken corpus was used as a parallel corpus to develop an Example-Based MT system described further in section XX Second the AVENUE team used the spoken corpus to derive a bilingual stem lexicon for use in Rule-Based MT systems

To obtain the bilingual stem lexicon the AVENUE team first converted the transcribed spoken corpus text into a monolingual Mapudungun word list The first 117003 most frequent fully-inflected Mapudungun word forms were hand-checked for spelling according to the adopted orthographic conventions for Mapudungun And 15120 of the most frequent spelling corrected fully inflected word forms were hand segmented into two parts The first part consisting of the stem of the word and the second part consisting of one or more suffixes This produced 5234 stems and 1303 suffix groups These 5234 stems were then translated into Spanish

2211 Mapudungun morphological analyzerIn addition to consuming data resources the MT sys-

tems used in the AVENUE project consume the computa-tional resource of morphological analysis Mapudungun is an agglutinative and polysynthetic language Figure 2 contains glosses of a few morphologically complex Ma-pudungun verbs that occur in our corpus of spoken Ma-pudungun Monson et al documents the large num-ber of occuring Mapudungun word forms in this speech corpus A typical complex verb form in our spoken corpus

cmonson 112206
Fix ref
cmonson 112206
Take this figure from a previous paper
cmonson 112206
add further whittling down of stems due to Robertorsquos more stringent criteria
cmonson 111506
add section
cmonson 112506
Is this fair Clean up this text
cmonson 111506
If this is true then why was orthography a concern in running Robertorsquos MT system on the newspaper data

consists of five or six morphemes Agglutinative morpho-logical processes explode the number of possible surface word forms while simultaneously causing individual sur-face word forms to be rare Furthermore each morpheme may alone contain several morphosyntactic features Hence it is difficult for an MT system to translate Ma-pudungun words directly into another language Analyz-ing the morphology of Mapudungun allows our MT sys-tems to cover a larger fraction of all Mapudungun surface word forms And by identifying each morpheme in each Mapudungun word and assigning a meaning to each iden-tified morpheme a rule-based MT system can translate in-dividually each piece of meaning

We had found few data resources available for Ma-pudungun So we were not surprised to learn that no mor-phological analyzer for Mapudungun had ever been writ-ten Necessarily then we developed our own morphologi-cal analyzer Using the stem lexicon we had created and reference grammars of Mapudungun our morphological analyzer takes a Mapudungun word as input and as output produces all possible morphological analyses of the word Each analysis specifies

A single stem in that word

Each suffix in that word

A syntactic analy-sis for the stem and each identified suffix

To identify the morphemes (stem and suffixes) in a Mapudungun word a lexicon of stems works together with a fairly complete lexicon of Mapudungun suffixes The current version of the stem lexicon contains XXX stems and their Spanish translations Each entry in this lexicon lists the part of speech of the stem as well as other features associated with the stem such as lexical aspect in the case of verb stems The suffix lexicon built by hand by computational linguists on the AVENUE team contains all the uncontroversial inflectional morphemes of Ma-pudungun There are XXX Mapudungun suffixes in the suffix lexicon Each suffix lists the part of speech that the suffix attaches to as well as the linguistic features such as person number mood etc that it marks The morpho-logical analyzer performs a recursive and exhaustive search on all possible segmentations of a given Mapudun-gun word The analyzer starts from the beginning of the word and identifies and removes each stem that is an ini-tial string in that word The current morphological ana-lyzer does not handle noun incorporation a fairly rare phenomenon in Mapudungun The analyzer then examines the remaining string looking for a valid combination of suffixes that could complete the word A suffix combina-tion is valid if Each of the

suffixes in the combination can attach to stems with the part of speech of the current stem and

The suffixes occur in a grammatical order

Mapudungun morphology is relatively unambiguous Most Mapudungun words for which the stem is in the stem lexicon receive a single analysis A few truly am-biguous suffix combinations may cause a Mapudungun word to receive more than one distinct analysis

Once the morphological analyzer has found all possi-ble segmentations of a word it combines the feature infor-mation from the stem and the suffixes encountered in the analyzed word to create a syntactic analysis that is re-turned For an example see Figure 3

222 Two Machine Translation SystemsWith the basic data and computational resources built

for Mapudungun the AVENUE team could proceed toward constructing machine translation systems from Mapudun-gun into Spanish We have developed two prototype Ma-pudungun-Spanish MT systems The first MT system is a data driven paradigm called Example-Based MT An EBMT system can exploit the decent sized spoken corpus of Mapudungun as a source of data to produce translations with higher lexical coverage but lower translation accu-racy The second MT system is a hand-built Rule-Based MT system Building an MT system by hand is feasible only because the AVENUE team has computational lin-guists available as a resource The human-developed Rule-Based MT system complements the EBMT system with high quality translations of specific syntactic con-structions over a smaller vocabulary

2221 An Example-Based Machine Translation Sys-tem

Example-Based Machine Translation (EBMT) relies on previous translations performed by humans to create new translations without the need for human translators The previous translations are called the training corpus For the best translation quality the training corpus should be as large as possible and as similar to the text to be translated as possible When the exact sentence to be translated occurs in the training material the translation quality is human-level because the previous translation is re-used As the source sentence differs more and more from the training material quality decreases because smaller and smaller fragments must be combined to pro-duce the translation increasing the chances of an incorrect translation

As the amount of training material decreases so does the translation quality because there are fewer long matches between the training texts and the input to be translated Conversely more training data can be added at any time improving the systems performance by allow-ing more and longer matches EBMT usually finds only partial matches which generate lower-quality translations Further EBMT searches for phrases of two or more words and thus there can be portions of the input which do not produce phrasal translations For unmatched por-tions of the input EBMT falls back on a probabilistic lexi-con trained from the corpus to produce word-for-word translations This fall-back approach provides some trans-lation (even though of lower quality) for any word form encountered in the training data While the translation quality of an EBMT system can be human-level when long phrases or entire sentence are matched any mistakes in the human translations used for trainingmdashspelling er-rors omissions mistranslationsmdashwill become visible in the EBMT systems output Thus it is important that the training data be as accurate as possible The training cor-pus we use for EBMT is the spoken language corpus de-scribed earlier As discussed in section the corpus of spoken Mapudungun contains some errors and awkward translations

cmonson 112506
fix
cmonson 112206
Get from previous paper
cmonson 111606
How many
cmonson 111606
How many
aria 181206
It might be better to say However no morphologicalhellip so we developed our own

Highly agglutinative languages like Mapudungun pose a challenge for Example Based MT Because there are so many inflected versions of each stem most inflected words are rare If the rare words do not occur in the cor-pus at all they will not be translatable by EBMT If they occur only a few times it will also be hard for EBMT to have accurate statistics about how they are used Addi-tionally automatic word-level alignment between the two languages which is used to determine the appropriate translation of a partial match becomes more difficult when individual words in one language correspond to many words in the other language We address both issues by using the morphological analyzer for Mapudungun to split words into individual stems and suffixes Alone each stem and suffix is more common than the original combi-nation of stem and suffixes and the separate morphemes are more likely to map to single words in Spanish

We currently have a prototype EBMT system trained on approximately 204000 sentence pairs from the corpus of spoken Mapudungun containing a total of 125 million Mapudungun tokens after splitting the original words into stems and one or more suffixes This separation increased BLEU scores (Papineni et al 2001) on a held out portion of the speech corpus by 548 from 01530 to 01614 We expect further improvements in MT accuracy from more detailed use of morphological analysis the inclusion of common phrases in the corpus and from repairing translation errors and awkward translations in the corpus

2222 A Rule-Based MT systemThe second MT system the AVENUE team has devel-

oped is a Rule-Based system where the rules have been built by hand Manually developing a Rule-based MT (RBMT) system involves detailed comparative analysis of the grammar of source and target languages Manually im-buing an MT system with detailed syntactic knowledge can produce high quality translation but typically takes a significant amount of time to implement Despite the time requirements developing a Rule-Based MT system for Mapudungun by hand is a natural fit in our case for two reasons First the AVENUE project has the necessary hu-

man expertise available to hand-develop an MT system with significant syntactic understanding And second hand-built rule-based MT nicely complements the weak-nesses of EBMT EBMT focuses on high translation cov-erage at the expense of accuracy in contrast the rule-based MT system does not exhaustively cover Mapudun-gun grammar but what rules we wrote cover a portion of Mapudungun with high accuracy

At a high level the AVENUE Rule-Based MT frame-work is composed of three modules a source language morphological analyzer the syntactic rule-based transfer system itself and a target language morphological genera-tor The source language morphological analyzer pro-cesses the raw input source sentence into a lattice of all possible morphological segmentations The transfer sys-tem maps the morphologically analyzed source sentences into target language expressions via a transfer lexicon and grammar The transfer grammar and lexicon contain syn-tactic and lexical rules that explicitly refer to morphosyn-tactic features of the two languages (more details are given in the next subsections) The output of the transfer system is a target language expression composed of unin-flected words bundled with grammatical features It is then the morphological generatorrsquos job to massage the un-inflected word-feature bundles into fully inflected target language output sentences

In the Mapudungun-Spanish Rule-Based MT system the source language morphological analyzer is the Ma-pudungun morphological analyzer described earlier in section The Mapudungun to Spanish transfer gram-mar and lexicon are described in section While the Spanish morphological generator is described in section But before delving into details of the Ma-pudungun-Spanish rule-based MT system a short over-view of the AVENUE rule-based MT framework is needed

2223 Overview of Transfer Rule Syntax and the Run-time Transfer System

This paper does not describe in detail the rationale be-hind either the transfer rules used in the AVENUE Rule-Based MT system or the run-time MT transfer engine

Figure A simplified hand-written AVENUE style transfer rule This rule transfers Mapudungun nominal expressions that mark for number into Spanish nominal expressions Where Spanish marks number morphologically on the head noun Mapudungun uses a separate particle to mark number on nouns

NBar1

NbarNbar [PART N] -gt [N]

((X2Y1)

((X1 number) =c pl)((X0 number) = (X1 number))

((Y0 number) = (Y1 number))((Y0 gender) = (Y1 gender))

((Y0 number) = (X0 number)))1) Rule identifier

2) SL and TL Constituent Structure Specification

3) SL (X2) to TL (Y1) Constituent Alignment

4) SL Constraint This rule applies only if SL number is pl4) SL Constraint SL number is passed up to the SL NBar

5) TL Constraint TL number is passed up to the TL NBar5) TL Constraint TL gender is passed up to the TL NBar

6) Transfer Equation At this point the number on the TL NBar must equal the number on the SL NBar

cmonson 112706
fix ref
cmonson 112706
fix ref
cmonson 112706
fix ref
aria 181206
We could mention that the MEMT approach (cite paper) can combine the benefits of both (at least in theory)

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 5: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Quechua being indigenous South American languages First both Mapudungun and Quechua pose unique chal-lenges from a linguistic theory perspective Although Ma-pudungun and Quechua are unrelated languages they both have complex agglutinative morphological structures In addition Mapudungun is polysynthetic incorporating ob-jects into the verb of a sentence Agglutination and polysynthesis are both linguistic properties that the major-ity of languages for which natural language resources have been built do not possess Second human factors also pose a particular challenge for these two languages Specifically there is a scarcity of people trained in com-putational linguistics who are native speakers or who have a good knowledge of these indigenous languages And fi-nally an often overlooked challenge that confronts devel-opment of NLP tools for resource-scarce languages is the divergence of the culture of the native speaker population from the western culture of computational linguistics

Despite the challenges facing the development of natu-ral language processing systems for Mapudungun and Quechua the AVENUE team has developed a suite of basic language resources for each of these languages and has then leveraged these resources into more sophisticated natural language processing tools The AVENUE project led a collaborative group of Mapudungun speakers in first collecting and then transcribing and translating into Span-ish by far the largest spoken corpus of Mapudungun avail-able From this corpus we then built a spelling checker and a morphological analyzer for Mapudungun For Quechua we have created parallel and aligned data as well as a basic bilingual lexicon with morphological infor-mation And with these resources we have built two proto-type machine translation (MT) systems for Mapudungun and one prototype MT system for Quechua The following sections detail the construction of these resources focus-ing on overcoming the specific challenges Mapudungun and Quechua each present as resource-scarce languages

22 MAPUDUNGUNSince May of 2000 in an effort to ultimately produce

Mapudungun-Spanish machine translation computational linguists at CMUrsquos Language Technologies Institute have collaborated with Mapudungun language experts at the In-stituto de Estudios Indigenas (IEI - Institute for Indige-nous Studies) at the Universidad de la Frontera (UFRO) in Chile and with the Bilingual and Multicultural Education Program of the Ministry of Education (Mineduc) in Chile (Levin et al 2000)

From the very outset of our collaboration we battled with the scarcity of electronic resources for Mapudungun The first phase of the AVENUE collaboration was to collect and produce parallel Mapudungun-Spanish language data from which higher-level language processing tools and systems could be built One barrier we faced in the collec-tion of Mapudungun language data is that there are cur-rently several competing orthographic conventions for written Mapudungun Early in the AVENUE Mapudungun collaboration the IEI-UFRO team established a set of or-thographic conventions All the data collected under the AVENUE project conforms to this set of orthographic con-ventions If the Mapudungun data was originally in some other orthographic convention then we manually con-verted it into our own orthography Recently however a different orthography Azuumlmchefi has been chosen by the

Chilean government for official documents Portions of our data have been automatically converted into Azuumlm-chefi using automatic substitution rules

221 Corpora and LexiconsRecognizing the appetite of MT systems for data re-

sources and the scarcity of Mapudungun language data the AVENUE team began collecting Mapudungun data soon after our collaboration began with IEI and UFRO in 2000 Directed from CMU and conducted by native speakers of Mapudungun at the UFRO in Temuco Chile our data col-lection efforts ultimately resulted in three separate cor-pora 1) a small parallel Mapudungun-Spanish corpus of historical texts and newspaper text 2) 1700 Spanish sen-tences that were manually translated and aligned into Ma-pudungun These 1700 Spanish sentences were carefully crafted to elicit specific syntactic phenomena in Ma-pudungun and 3) a parallel corpus consisting of 170 hours of transcribed and translated Mapudungun speech Natural language translation is challenging even for hu-mans particularly when the source text to translate is spo-ken language with its many disfluencies and colloquial expressions The translation of the speech corpus was completed by bilingual speakers of Mapudungun and Spanish but not by professional translators who attempted to translate as literally as possible Furthermore each sec-tion of the speech corpus was only ever viewed by a sin-gle translator All of these translation challenges combine to render the Spanish translation of the Mapudungun text imperfect Still a bilingual corpus is a valuable resource Additional details on corpora are described in Monson et al 2004

While small the second corpus listed above the struc-tural elicitation corpus has aided our development of the hand-built rule-based MT system for Mapudungun (see section XX) The larger spoken corpus was put to two specific uses to develop MT for Mapudungun First the transcribed and translated spoken corpus was used as a parallel corpus to develop an Example-Based MT system described further in section XX Second the AVENUE team used the spoken corpus to derive a bilingual stem lexicon for use in Rule-Based MT systems

To obtain the bilingual stem lexicon the AVENUE team first converted the transcribed spoken corpus text into a monolingual Mapudungun word list The first 117003 most frequent fully-inflected Mapudungun word forms were hand-checked for spelling according to the adopted orthographic conventions for Mapudungun And 15120 of the most frequent spelling corrected fully inflected word forms were hand segmented into two parts The first part consisting of the stem of the word and the second part consisting of one or more suffixes This produced 5234 stems and 1303 suffix groups These 5234 stems were then translated into Spanish

2211 Mapudungun morphological analyzerIn addition to consuming data resources the MT sys-

tems used in the AVENUE project consume the computa-tional resource of morphological analysis Mapudungun is an agglutinative and polysynthetic language Figure 2 contains glosses of a few morphologically complex Ma-pudungun verbs that occur in our corpus of spoken Ma-pudungun Monson et al documents the large num-ber of occuring Mapudungun word forms in this speech corpus A typical complex verb form in our spoken corpus

cmonson 112206
Fix ref
cmonson 112206
Take this figure from a previous paper
cmonson 112206
add further whittling down of stems due to Robertorsquos more stringent criteria
cmonson 111506
add section
cmonson 112506
Is this fair Clean up this text
cmonson 111506
If this is true then why was orthography a concern in running Robertorsquos MT system on the newspaper data

consists of five or six morphemes Agglutinative morpho-logical processes explode the number of possible surface word forms while simultaneously causing individual sur-face word forms to be rare Furthermore each morpheme may alone contain several morphosyntactic features Hence it is difficult for an MT system to translate Ma-pudungun words directly into another language Analyz-ing the morphology of Mapudungun allows our MT sys-tems to cover a larger fraction of all Mapudungun surface word forms And by identifying each morpheme in each Mapudungun word and assigning a meaning to each iden-tified morpheme a rule-based MT system can translate in-dividually each piece of meaning

We had found few data resources available for Ma-pudungun So we were not surprised to learn that no mor-phological analyzer for Mapudungun had ever been writ-ten Necessarily then we developed our own morphologi-cal analyzer Using the stem lexicon we had created and reference grammars of Mapudungun our morphological analyzer takes a Mapudungun word as input and as output produces all possible morphological analyses of the word Each analysis specifies

A single stem in that word

Each suffix in that word

A syntactic analy-sis for the stem and each identified suffix

To identify the morphemes (stem and suffixes) in a Mapudungun word a lexicon of stems works together with a fairly complete lexicon of Mapudungun suffixes The current version of the stem lexicon contains XXX stems and their Spanish translations Each entry in this lexicon lists the part of speech of the stem as well as other features associated with the stem such as lexical aspect in the case of verb stems The suffix lexicon built by hand by computational linguists on the AVENUE team contains all the uncontroversial inflectional morphemes of Ma-pudungun There are XXX Mapudungun suffixes in the suffix lexicon Each suffix lists the part of speech that the suffix attaches to as well as the linguistic features such as person number mood etc that it marks The morpho-logical analyzer performs a recursive and exhaustive search on all possible segmentations of a given Mapudun-gun word The analyzer starts from the beginning of the word and identifies and removes each stem that is an ini-tial string in that word The current morphological ana-lyzer does not handle noun incorporation a fairly rare phenomenon in Mapudungun The analyzer then examines the remaining string looking for a valid combination of suffixes that could complete the word A suffix combina-tion is valid if Each of the

suffixes in the combination can attach to stems with the part of speech of the current stem and

The suffixes occur in a grammatical order

Mapudungun morphology is relatively unambiguous Most Mapudungun words for which the stem is in the stem lexicon receive a single analysis A few truly am-biguous suffix combinations may cause a Mapudungun word to receive more than one distinct analysis

Once the morphological analyzer has found all possi-ble segmentations of a word it combines the feature infor-mation from the stem and the suffixes encountered in the analyzed word to create a syntactic analysis that is re-turned For an example see Figure 3

222 Two Machine Translation SystemsWith the basic data and computational resources built

for Mapudungun the AVENUE team could proceed toward constructing machine translation systems from Mapudun-gun into Spanish We have developed two prototype Ma-pudungun-Spanish MT systems The first MT system is a data driven paradigm called Example-Based MT An EBMT system can exploit the decent sized spoken corpus of Mapudungun as a source of data to produce translations with higher lexical coverage but lower translation accu-racy The second MT system is a hand-built Rule-Based MT system Building an MT system by hand is feasible only because the AVENUE team has computational lin-guists available as a resource The human-developed Rule-Based MT system complements the EBMT system with high quality translations of specific syntactic con-structions over a smaller vocabulary

2221 An Example-Based Machine Translation Sys-tem

Example-Based Machine Translation (EBMT) relies on previous translations performed by humans to create new translations without the need for human translators The previous translations are called the training corpus For the best translation quality the training corpus should be as large as possible and as similar to the text to be translated as possible When the exact sentence to be translated occurs in the training material the translation quality is human-level because the previous translation is re-used As the source sentence differs more and more from the training material quality decreases because smaller and smaller fragments must be combined to pro-duce the translation increasing the chances of an incorrect translation

As the amount of training material decreases so does the translation quality because there are fewer long matches between the training texts and the input to be translated Conversely more training data can be added at any time improving the systems performance by allow-ing more and longer matches EBMT usually finds only partial matches which generate lower-quality translations Further EBMT searches for phrases of two or more words and thus there can be portions of the input which do not produce phrasal translations For unmatched por-tions of the input EBMT falls back on a probabilistic lexi-con trained from the corpus to produce word-for-word translations This fall-back approach provides some trans-lation (even though of lower quality) for any word form encountered in the training data While the translation quality of an EBMT system can be human-level when long phrases or entire sentence are matched any mistakes in the human translations used for trainingmdashspelling er-rors omissions mistranslationsmdashwill become visible in the EBMT systems output Thus it is important that the training data be as accurate as possible The training cor-pus we use for EBMT is the spoken language corpus de-scribed earlier As discussed in section the corpus of spoken Mapudungun contains some errors and awkward translations

cmonson 112506
fix
cmonson 112206
Get from previous paper
cmonson 111606
How many
cmonson 111606
How many
aria 181206
It might be better to say However no morphologicalhellip so we developed our own

Highly agglutinative languages like Mapudungun pose a challenge for Example Based MT Because there are so many inflected versions of each stem most inflected words are rare If the rare words do not occur in the cor-pus at all they will not be translatable by EBMT If they occur only a few times it will also be hard for EBMT to have accurate statistics about how they are used Addi-tionally automatic word-level alignment between the two languages which is used to determine the appropriate translation of a partial match becomes more difficult when individual words in one language correspond to many words in the other language We address both issues by using the morphological analyzer for Mapudungun to split words into individual stems and suffixes Alone each stem and suffix is more common than the original combi-nation of stem and suffixes and the separate morphemes are more likely to map to single words in Spanish

We currently have a prototype EBMT system trained on approximately 204000 sentence pairs from the corpus of spoken Mapudungun containing a total of 125 million Mapudungun tokens after splitting the original words into stems and one or more suffixes This separation increased BLEU scores (Papineni et al 2001) on a held out portion of the speech corpus by 548 from 01530 to 01614 We expect further improvements in MT accuracy from more detailed use of morphological analysis the inclusion of common phrases in the corpus and from repairing translation errors and awkward translations in the corpus

2222 A Rule-Based MT systemThe second MT system the AVENUE team has devel-

oped is a Rule-Based system where the rules have been built by hand Manually developing a Rule-based MT (RBMT) system involves detailed comparative analysis of the grammar of source and target languages Manually im-buing an MT system with detailed syntactic knowledge can produce high quality translation but typically takes a significant amount of time to implement Despite the time requirements developing a Rule-Based MT system for Mapudungun by hand is a natural fit in our case for two reasons First the AVENUE project has the necessary hu-

man expertise available to hand-develop an MT system with significant syntactic understanding And second hand-built rule-based MT nicely complements the weak-nesses of EBMT EBMT focuses on high translation cov-erage at the expense of accuracy in contrast the rule-based MT system does not exhaustively cover Mapudun-gun grammar but what rules we wrote cover a portion of Mapudungun with high accuracy

At a high level the AVENUE Rule-Based MT frame-work is composed of three modules a source language morphological analyzer the syntactic rule-based transfer system itself and a target language morphological genera-tor The source language morphological analyzer pro-cesses the raw input source sentence into a lattice of all possible morphological segmentations The transfer sys-tem maps the morphologically analyzed source sentences into target language expressions via a transfer lexicon and grammar The transfer grammar and lexicon contain syn-tactic and lexical rules that explicitly refer to morphosyn-tactic features of the two languages (more details are given in the next subsections) The output of the transfer system is a target language expression composed of unin-flected words bundled with grammatical features It is then the morphological generatorrsquos job to massage the un-inflected word-feature bundles into fully inflected target language output sentences

In the Mapudungun-Spanish Rule-Based MT system the source language morphological analyzer is the Ma-pudungun morphological analyzer described earlier in section The Mapudungun to Spanish transfer gram-mar and lexicon are described in section While the Spanish morphological generator is described in section But before delving into details of the Ma-pudungun-Spanish rule-based MT system a short over-view of the AVENUE rule-based MT framework is needed

2223 Overview of Transfer Rule Syntax and the Run-time Transfer System

This paper does not describe in detail the rationale be-hind either the transfer rules used in the AVENUE Rule-Based MT system or the run-time MT transfer engine

Figure A simplified hand-written AVENUE style transfer rule This rule transfers Mapudungun nominal expressions that mark for number into Spanish nominal expressions Where Spanish marks number morphologically on the head noun Mapudungun uses a separate particle to mark number on nouns

NBar1

NbarNbar [PART N] -gt [N]

((X2Y1)

((X1 number) =c pl)((X0 number) = (X1 number))

((Y0 number) = (Y1 number))((Y0 gender) = (Y1 gender))

((Y0 number) = (X0 number)))1) Rule identifier

2) SL and TL Constituent Structure Specification

3) SL (X2) to TL (Y1) Constituent Alignment

4) SL Constraint This rule applies only if SL number is pl4) SL Constraint SL number is passed up to the SL NBar

5) TL Constraint TL number is passed up to the TL NBar5) TL Constraint TL gender is passed up to the TL NBar

6) Transfer Equation At this point the number on the TL NBar must equal the number on the SL NBar

cmonson 112706
fix ref
cmonson 112706
fix ref
cmonson 112706
fix ref
aria 181206
We could mention that the MEMT approach (cite paper) can combine the benefits of both (at least in theory)

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 6: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

consists of five or six morphemes Agglutinative morpho-logical processes explode the number of possible surface word forms while simultaneously causing individual sur-face word forms to be rare Furthermore each morpheme may alone contain several morphosyntactic features Hence it is difficult for an MT system to translate Ma-pudungun words directly into another language Analyz-ing the morphology of Mapudungun allows our MT sys-tems to cover a larger fraction of all Mapudungun surface word forms And by identifying each morpheme in each Mapudungun word and assigning a meaning to each iden-tified morpheme a rule-based MT system can translate in-dividually each piece of meaning

We had found few data resources available for Ma-pudungun So we were not surprised to learn that no mor-phological analyzer for Mapudungun had ever been writ-ten Necessarily then we developed our own morphologi-cal analyzer Using the stem lexicon we had created and reference grammars of Mapudungun our morphological analyzer takes a Mapudungun word as input and as output produces all possible morphological analyses of the word Each analysis specifies

A single stem in that word

Each suffix in that word

A syntactic analy-sis for the stem and each identified suffix

To identify the morphemes (stem and suffixes) in a Mapudungun word a lexicon of stems works together with a fairly complete lexicon of Mapudungun suffixes The current version of the stem lexicon contains XXX stems and their Spanish translations Each entry in this lexicon lists the part of speech of the stem as well as other features associated with the stem such as lexical aspect in the case of verb stems The suffix lexicon built by hand by computational linguists on the AVENUE team contains all the uncontroversial inflectional morphemes of Ma-pudungun There are XXX Mapudungun suffixes in the suffix lexicon Each suffix lists the part of speech that the suffix attaches to as well as the linguistic features such as person number mood etc that it marks The morpho-logical analyzer performs a recursive and exhaustive search on all possible segmentations of a given Mapudun-gun word The analyzer starts from the beginning of the word and identifies and removes each stem that is an ini-tial string in that word The current morphological ana-lyzer does not handle noun incorporation a fairly rare phenomenon in Mapudungun The analyzer then examines the remaining string looking for a valid combination of suffixes that could complete the word A suffix combina-tion is valid if Each of the

suffixes in the combination can attach to stems with the part of speech of the current stem and

The suffixes occur in a grammatical order

Mapudungun morphology is relatively unambiguous Most Mapudungun words for which the stem is in the stem lexicon receive a single analysis A few truly am-biguous suffix combinations may cause a Mapudungun word to receive more than one distinct analysis

Once the morphological analyzer has found all possi-ble segmentations of a word it combines the feature infor-mation from the stem and the suffixes encountered in the analyzed word to create a syntactic analysis that is re-turned For an example see Figure 3

222 Two Machine Translation SystemsWith the basic data and computational resources built

for Mapudungun the AVENUE team could proceed toward constructing machine translation systems from Mapudun-gun into Spanish We have developed two prototype Ma-pudungun-Spanish MT systems The first MT system is a data driven paradigm called Example-Based MT An EBMT system can exploit the decent sized spoken corpus of Mapudungun as a source of data to produce translations with higher lexical coverage but lower translation accu-racy The second MT system is a hand-built Rule-Based MT system Building an MT system by hand is feasible only because the AVENUE team has computational lin-guists available as a resource The human-developed Rule-Based MT system complements the EBMT system with high quality translations of specific syntactic con-structions over a smaller vocabulary

2221 An Example-Based Machine Translation Sys-tem

Example-Based Machine Translation (EBMT) relies on previous translations performed by humans to create new translations without the need for human translators The previous translations are called the training corpus For the best translation quality the training corpus should be as large as possible and as similar to the text to be translated as possible When the exact sentence to be translated occurs in the training material the translation quality is human-level because the previous translation is re-used As the source sentence differs more and more from the training material quality decreases because smaller and smaller fragments must be combined to pro-duce the translation increasing the chances of an incorrect translation

As the amount of training material decreases so does the translation quality because there are fewer long matches between the training texts and the input to be translated Conversely more training data can be added at any time improving the systems performance by allow-ing more and longer matches EBMT usually finds only partial matches which generate lower-quality translations Further EBMT searches for phrases of two or more words and thus there can be portions of the input which do not produce phrasal translations For unmatched por-tions of the input EBMT falls back on a probabilistic lexi-con trained from the corpus to produce word-for-word translations This fall-back approach provides some trans-lation (even though of lower quality) for any word form encountered in the training data While the translation quality of an EBMT system can be human-level when long phrases or entire sentence are matched any mistakes in the human translations used for trainingmdashspelling er-rors omissions mistranslationsmdashwill become visible in the EBMT systems output Thus it is important that the training data be as accurate as possible The training cor-pus we use for EBMT is the spoken language corpus de-scribed earlier As discussed in section the corpus of spoken Mapudungun contains some errors and awkward translations

cmonson 112506
fix
cmonson 112206
Get from previous paper
cmonson 111606
How many
cmonson 111606
How many
aria 181206
It might be better to say However no morphologicalhellip so we developed our own

Highly agglutinative languages like Mapudungun pose a challenge for Example Based MT Because there are so many inflected versions of each stem most inflected words are rare If the rare words do not occur in the cor-pus at all they will not be translatable by EBMT If they occur only a few times it will also be hard for EBMT to have accurate statistics about how they are used Addi-tionally automatic word-level alignment between the two languages which is used to determine the appropriate translation of a partial match becomes more difficult when individual words in one language correspond to many words in the other language We address both issues by using the morphological analyzer for Mapudungun to split words into individual stems and suffixes Alone each stem and suffix is more common than the original combi-nation of stem and suffixes and the separate morphemes are more likely to map to single words in Spanish

We currently have a prototype EBMT system trained on approximately 204000 sentence pairs from the corpus of spoken Mapudungun containing a total of 125 million Mapudungun tokens after splitting the original words into stems and one or more suffixes This separation increased BLEU scores (Papineni et al 2001) on a held out portion of the speech corpus by 548 from 01530 to 01614 We expect further improvements in MT accuracy from more detailed use of morphological analysis the inclusion of common phrases in the corpus and from repairing translation errors and awkward translations in the corpus

2222 A Rule-Based MT systemThe second MT system the AVENUE team has devel-

oped is a Rule-Based system where the rules have been built by hand Manually developing a Rule-based MT (RBMT) system involves detailed comparative analysis of the grammar of source and target languages Manually im-buing an MT system with detailed syntactic knowledge can produce high quality translation but typically takes a significant amount of time to implement Despite the time requirements developing a Rule-Based MT system for Mapudungun by hand is a natural fit in our case for two reasons First the AVENUE project has the necessary hu-

man expertise available to hand-develop an MT system with significant syntactic understanding And second hand-built rule-based MT nicely complements the weak-nesses of EBMT EBMT focuses on high translation cov-erage at the expense of accuracy in contrast the rule-based MT system does not exhaustively cover Mapudun-gun grammar but what rules we wrote cover a portion of Mapudungun with high accuracy

At a high level the AVENUE Rule-Based MT frame-work is composed of three modules a source language morphological analyzer the syntactic rule-based transfer system itself and a target language morphological genera-tor The source language morphological analyzer pro-cesses the raw input source sentence into a lattice of all possible morphological segmentations The transfer sys-tem maps the morphologically analyzed source sentences into target language expressions via a transfer lexicon and grammar The transfer grammar and lexicon contain syn-tactic and lexical rules that explicitly refer to morphosyn-tactic features of the two languages (more details are given in the next subsections) The output of the transfer system is a target language expression composed of unin-flected words bundled with grammatical features It is then the morphological generatorrsquos job to massage the un-inflected word-feature bundles into fully inflected target language output sentences

In the Mapudungun-Spanish Rule-Based MT system the source language morphological analyzer is the Ma-pudungun morphological analyzer described earlier in section The Mapudungun to Spanish transfer gram-mar and lexicon are described in section While the Spanish morphological generator is described in section But before delving into details of the Ma-pudungun-Spanish rule-based MT system a short over-view of the AVENUE rule-based MT framework is needed

2223 Overview of Transfer Rule Syntax and the Run-time Transfer System

This paper does not describe in detail the rationale be-hind either the transfer rules used in the AVENUE Rule-Based MT system or the run-time MT transfer engine

Figure A simplified hand-written AVENUE style transfer rule This rule transfers Mapudungun nominal expressions that mark for number into Spanish nominal expressions Where Spanish marks number morphologically on the head noun Mapudungun uses a separate particle to mark number on nouns

NBar1

NbarNbar [PART N] -gt [N]

((X2Y1)

((X1 number) =c pl)((X0 number) = (X1 number))

((Y0 number) = (Y1 number))((Y0 gender) = (Y1 gender))

((Y0 number) = (X0 number)))1) Rule identifier

2) SL and TL Constituent Structure Specification

3) SL (X2) to TL (Y1) Constituent Alignment

4) SL Constraint This rule applies only if SL number is pl4) SL Constraint SL number is passed up to the SL NBar

5) TL Constraint TL number is passed up to the TL NBar5) TL Constraint TL gender is passed up to the TL NBar

6) Transfer Equation At this point the number on the TL NBar must equal the number on the SL NBar

cmonson 112706
fix ref
cmonson 112706
fix ref
cmonson 112706
fix ref
aria 181206
We could mention that the MEMT approach (cite paper) can combine the benefits of both (at least in theory)

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 7: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Highly agglutinative languages like Mapudungun pose a challenge for Example Based MT Because there are so many inflected versions of each stem most inflected words are rare If the rare words do not occur in the cor-pus at all they will not be translatable by EBMT If they occur only a few times it will also be hard for EBMT to have accurate statistics about how they are used Addi-tionally automatic word-level alignment between the two languages which is used to determine the appropriate translation of a partial match becomes more difficult when individual words in one language correspond to many words in the other language We address both issues by using the morphological analyzer for Mapudungun to split words into individual stems and suffixes Alone each stem and suffix is more common than the original combi-nation of stem and suffixes and the separate morphemes are more likely to map to single words in Spanish

We currently have a prototype EBMT system trained on approximately 204000 sentence pairs from the corpus of spoken Mapudungun containing a total of 125 million Mapudungun tokens after splitting the original words into stems and one or more suffixes This separation increased BLEU scores (Papineni et al 2001) on a held out portion of the speech corpus by 548 from 01530 to 01614 We expect further improvements in MT accuracy from more detailed use of morphological analysis the inclusion of common phrases in the corpus and from repairing translation errors and awkward translations in the corpus

2222 A Rule-Based MT systemThe second MT system the AVENUE team has devel-

oped is a Rule-Based system where the rules have been built by hand Manually developing a Rule-based MT (RBMT) system involves detailed comparative analysis of the grammar of source and target languages Manually im-buing an MT system with detailed syntactic knowledge can produce high quality translation but typically takes a significant amount of time to implement Despite the time requirements developing a Rule-Based MT system for Mapudungun by hand is a natural fit in our case for two reasons First the AVENUE project has the necessary hu-

man expertise available to hand-develop an MT system with significant syntactic understanding And second hand-built rule-based MT nicely complements the weak-nesses of EBMT EBMT focuses on high translation cov-erage at the expense of accuracy in contrast the rule-based MT system does not exhaustively cover Mapudun-gun grammar but what rules we wrote cover a portion of Mapudungun with high accuracy

At a high level the AVENUE Rule-Based MT frame-work is composed of three modules a source language morphological analyzer the syntactic rule-based transfer system itself and a target language morphological genera-tor The source language morphological analyzer pro-cesses the raw input source sentence into a lattice of all possible morphological segmentations The transfer sys-tem maps the morphologically analyzed source sentences into target language expressions via a transfer lexicon and grammar The transfer grammar and lexicon contain syn-tactic and lexical rules that explicitly refer to morphosyn-tactic features of the two languages (more details are given in the next subsections) The output of the transfer system is a target language expression composed of unin-flected words bundled with grammatical features It is then the morphological generatorrsquos job to massage the un-inflected word-feature bundles into fully inflected target language output sentences

In the Mapudungun-Spanish Rule-Based MT system the source language morphological analyzer is the Ma-pudungun morphological analyzer described earlier in section The Mapudungun to Spanish transfer gram-mar and lexicon are described in section While the Spanish morphological generator is described in section But before delving into details of the Ma-pudungun-Spanish rule-based MT system a short over-view of the AVENUE rule-based MT framework is needed

2223 Overview of Transfer Rule Syntax and the Run-time Transfer System

This paper does not describe in detail the rationale be-hind either the transfer rules used in the AVENUE Rule-Based MT system or the run-time MT transfer engine

Figure A simplified hand-written AVENUE style transfer rule This rule transfers Mapudungun nominal expressions that mark for number into Spanish nominal expressions Where Spanish marks number morphologically on the head noun Mapudungun uses a separate particle to mark number on nouns

NBar1

NbarNbar [PART N] -gt [N]

((X2Y1)

((X1 number) =c pl)((X0 number) = (X1 number))

((Y0 number) = (Y1 number))((Y0 gender) = (Y1 gender))

((Y0 number) = (X0 number)))1) Rule identifier

2) SL and TL Constituent Structure Specification

3) SL (X2) to TL (Y1) Constituent Alignment

4) SL Constraint This rule applies only if SL number is pl4) SL Constraint SL number is passed up to the SL NBar

5) TL Constraint TL number is passed up to the TL NBar5) TL Constraint TL gender is passed up to the TL NBar

6) Transfer Equation At this point the number on the TL NBar must equal the number on the SL NBar

cmonson 112706
fix ref
cmonson 112706
fix ref
cmonson 112706
fix ref
aria 181206
We could mention that the MEMT approach (cite paper) can combine the benefits of both (at least in theory)

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 8: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

used to interpret them This section however does pro-vide a basic understanding of syntactic rules and their in-terpretation within Rule-Based MT in the AVENUE frame-work For more specifics on the AVENUE Rule-Based MT formalism and run-time transfer engine please see

22231 The AVENUE Transfer Rule FormalismIn Rule-Based MT transfer rules decompose the gram-

matical information contained in a source language ex-pression into a set of grammatical properties such as number person tense lexical meaning etc Then each particular rule builds an equivalent target language ex-pression copying modifying or rearranging grammatical values according to the requirements of the target lan-guage syntax and target language lexical repertoire In the AVENUE system translation rules have six components (see figure ) 1) a rule identifier which consists of a constituent type (Sentence Noun Phrase Verb Phrase etc) and a unique ID number 2) a constituent structure specification for both the source language (SL) and the target language (TL) 3) alignments between the SL con-stituents and the TL constituents 4) SL constraints which provide information about features and their values in the SL sentence 5) TL constraints which provide informa-tion about features and their values in the TL sentence and 6) transfer equations which provide information about which feature values transfer from the source into the target language

Figure illustrates these six components using a portion of a rule intended for a Mapudungun to Spanish rule-based transfer system In Mapudungun plurality in nouns is marked in some cases by the pronominal parti-cle pu The transfer rule in Figure transfers the plural feature from pu in Mapudungun onto Spanish nouns In this Transfer Grammar NBar is the constituent that domi-nates the noun and its modifiers but not its determiners According to this rule the Mapudungun sequence PART N will be transferred into a noun in Spanish The Con-stituent Alignment aligns the Mapudungun noun (X2) with the Spanish noun (Y1) The SL is checked to ensure the application of the NBar 1 rule only when the Ma-pudungun particle specifies plural numbermdashif the Ma-pudungun noun is preceded by any other particle the rule fails to apply The number feature is passed up from the particle to the Mapudungun NBar At the same time the number that is marked on the Spanish noun is passed up to the Spanish NBar The gender feature present only in Spanish is passed up from the Spanish noun to the Span-ish NBar And finally a Transfer Equation ensures that the number on the Spanish NBar equals the number on the Mapudungun NBar This process is represented graphi-cally by the tree structure showed in Figure

We make use of a new framework (Probst et al 2002) for rapid prototyping of MT systems for languages with limited amounts of electronically available linguistic re-sources and corpora which includes a declarative formal-ism for symbolic transfer grammars A grammar consists of a collection of transfer rules which specify how phrase structures in a source-language correspond and transfer to phrase structures in a target language and the constraints under which these rules should apply The framework also includes a fully-implemented transfer engine that applies the transfer grammar to a source-language input sentence at runtime and produces all possible word and phrase-level translations according to the grammar This frame-work was designed to support research on methods for au-tomatically acquiring transfer grammars from limited amounts of elicited word-aligned data The framework also supports manual development of transfer grammars by experts familiar with the two languages While the framework itself is still research work in progress it is sufficiently well developed for serious experimentation

All arrows UP and then CHECK

Carnegie Mellon University 181206
This is general background about AVENUE
Carnegie Mellon University 301106
Fix ref and graphic
cmonson 112806
Fix ref
cmonson 112806
Fix ref
Carnegie Mellon University 301106
fix ref
cmonson 112706
provide ref
aria 181206
I would not start with what this section does not but rather with what it does and at the end we add For more detailed description refer to (Petterson 2002) and maybe even (Probst 2005)

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 9: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

22232 The Runtime Transfer EngineAt run time the transfer engine interprets individual

transfer rules like that in Figure to convert a morpho-logically analyzed source language sentence into a target language expression The output of the run-time system is a lattice of translation alternatives that have yet to be mor-phologically inflected in the target language The transfer engine produces a lattice of alternatives as opposed to a single target language expression because there may be syntactic and lexical ambiguity in the transfer rules and lexicon Multiple translation choices for source language lexical items in the dictionary and multiple competing transfer rules can each increase the number of possible translations (see next section)

The run-time translation system incorporates into a single module all three of the traditional steps of syntactic transfer-based MT 1) syntactic parsing of the source lan-guage input 2) transfer of the parsed constituents of the source language to their corresponding structured con-stituents on the target language side and 3) generation of the target language expressions from the target language constituents All three of these processes are performed based on the transfer grammarndashthe comprehensive set of transfer rules that are loaded into the run-time system

In the first stage source language (SL) parsing is per-formed based solely on the SL side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser such as described in Allen (1995) A chart is populated with all the constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer gram-mar Transfer and generation are performed in an inte-grated second stage A target language (TL) chart is con-structed by applying transfer and generation operations on each and every constituent entry in the SL parse chart The transfer rules associated with each entry in the SL chart determine the corresponding constituent structure on the TL side At the word level lexical transfer rules pro-vide individual lexical choices for the TL word-level en-tries in the TL chart The set of generated TL output ex-pressions that corresponds to the collection of all TL chart entries is then collected into a TL lattice Finally the TL lattice is passed to a statistical decoder which ranks the possible paths through the lattice of translation alterna-tives A more detailed description of the runtime transfer-based translation sub-system can be found in Peterson (2002)

The methodology behind the system combines two separate modules a transfer engine which produces a lat-tice of possible translation segments and a decoder which searches for and selects the most likely translation accord-ing to an English language model This general frame-work has been under development by the MT group at Carnegie Mellon under the AVENUE project (Probst et al 2002) and was previously used for rapid prototyping of an MT system for Hindi-to-English translation (Lavie et al 2003)

The transfer engine is the module responsible for ap-plying the comprehensive set of lexical and structural transfer rules specified by the translation lexicon and the transfer grammar (respectively) to the source-language (SL) input lattice producing a comprehensive collection

of target-language (TL) output segments The output of the transfer engine is a lattice of alternative translation segments The alternatives arise from syntactic ambiguity lexical ambiguity and multiple synonymous choices for lexical items in the translation lexicon

The transfer engine incorporates the three main pro-cesses involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their corresponding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammar In the first stage parsing is performed based solely on the source-lan-guage side of the transfer rules The implemented parsing algorithm is for the most part a standard bottom-up Chart Parser A chart is populated with all constituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an integrated second stage A dual TL chart is constructed by applying transfer and generation operations on each and every con-stituent entry in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used to deter-mine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of gen-erated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed descrip-tion of the transfer engine can be found in (Peterson 2002)

35 Decoding

In the final stage a decoder is used in order to select a single target language translation output from a lattice that represents the complete set of translation units that were created for all substrings of the input sentence The trans-lation units in the lattice are organized according to the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for different contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selec-tion of individual word translations

The task of the decoder is to select a linear sequence of adjoining but non-overlapping translation units that maximizes the probability of the target language string given the source language string The decoder is designed to use a probability model that calculates this probability as a product of two factors a translation model for the translation units and a language model for the target lan-guage The translation model is normally based on a prob-abilistic version of the translation lexicon in which proba-bilities are associated with the different possible transla-tions for each source-language word Such a probability model can be acquired automatically using an reasonable-size parallel corpus in order to train a word-to-word prob-ability model based on automatic word alignment algo-rithms This was currently not possible for Hebrew-to-English since we do not have a suitable parallel corpus at our disposal The decoder for the Hebrew-to-English sys-tem therefore uses the English language model as the sole source of information for selecting among alternative

Carnegie Mellon University 181206
AVENUE Background
Carnegie Mellon University 301106
Is this reasonable to say any more Or should we briefly state the new fully bottom-up transfer approach
cmonson 112806
Make sure bib entry is there
cmonson 112806
for what
Carnegie Mellon University 301106
Fix ref

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 10: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

translation segments We use an English trigram language model trained on 160 million words

The decoding search algorithm considers all possible sequences in the lattice and calculates the language model probability for the resulting sequence of target words It then selects the sequence which has the highest overall probability As part of the decoding search the decoder can also perform a limited amount of re-ordering of trans-lation units in the lattice when such reordering results in a better t to the target language model

2224 Mapudungun Rule-Based SystemWith a basic understanding of the syntactic transfer

engine and rule semantic in hand we now return to a dis-cussion of the Mapudungun-Spanish manually written rule-based MT system While a hand-written system for Mapudungun naturally draws on the human expertise of the AVENUE Mapudungun team neither did we neglect the available computational resources for Mapudungun We developed the Mapudungun-Spanish lexicon for the rule-based system from the collected spoken corpus Similarly a version of the elicitation corpus translated into Ma-pudungun served as a development suite to guide us on which syntactic transfer rules were most vital to write

The next three sub-sections discuss specific challenges facing Mapudungun-Spanish translation Any human or machine translator must explicitly or implicitly address these three challenges (among others) when translating between this pair of languages Manually writing transfer rules forced us to consciously overcome these challenges The first challenge discussed is the transfer of agglutina-tive sequences of Mapudungun suffixes into grammatical Spanish expressions The second challenge is to specify Spanish tense from Mapudungun which is largely un-marked for tense And the third and final challenge is the handling of a class of grammatical structures expressed through morphology in Mapudungun but by syntax in Spanish These three challenges merely give a flavor for the kind of transfer rules that are needed to write a transfer grammar from Mapudungun for Spanishmdashtwo completely unrelated languages For a more in depth discussion of the transfer grammar and translation mismatches between Mapudungun and Spanish please see

22241 Suffix AgglutinationAs discussed in section Mapudungun is an aggluti-

native languagemdashwith often four or more inflectional suf-fixes occuring on a single stem The manually written transfer grammar manages suffix agglutination in Ma-pudungun by constructing constituents called Verbal Suf-fix Groups (VSuffG) The rules that build VSuffG con-stituents operate recursively The transfer rule labeled VSuffG 1 in Figure parses a Verbal Suffix (VSuff) into a VSuffG copying the entire set of features belong-ing to the suffix into the new constituent A second

VSuffG rule labeled VSuffG 2 in Figure com-bines an additional VSuff recursively with a VSuffG passing up the feature structures of both children to the parent node Notice that at the VSuffG level there is no transfer of features to the target language Spanish and there are no alignments specified between source and tar-get constituents Because Mapudungun (morphological) syntax differs so radically from Spanish the VSuffG rules decouple the parsing of Mapudungun from the transfer to and generation of Spanish expressions

These two VSuffG rules successfully applied to combine the two verbal suffixes -fi and -ntilde into a single VSuffG First the rule VSuffG 1 is applied to -fi to pro-duce a VSuffG marked with third person object agree-ment Next the rule VSuffG 2 recursively combines -ntilde to the VSuffG containing -fi into a VSuffG marked both for object agreement and mood and subject agreement The result is a Verb Suffix Group that has all the gram-matical features of its components If additional Ma-pudungun suffixes occurred beyond -fi and -ntilde suffix pars-ing would continue rerursively

(1) pe -fi -ntilde see -3obj -1sgsubjindic

VSuffG1VSuffGVSuffG [VSuff] -gt []((X0 = X1))VSuffG2VSuffGVSuffG [VSuffG VSuff] -gt []((X0 = X1)(X0 = X2))

Figure 6 Two recursive Verbal Suffix Group Rules

Katharina Probst 041206
Fix example ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref
Katharina Probst 041206
Fix ref Section that discusses Mapudungun (possibly lsquotwo indigenous languagesrsquo section 21)
Carnegie Mellon University 011206
Roberto COMPS paper Forthcoming
Carnegie Mellon University 261206
More background from the Hebrew paper that should be incorporated if it is not already

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 11: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

lsquoI saw heshethemitrsquo

22242 TenseTense in Mapudungun is often morphologically un-

marked The temporal interpretation of a verb is deter-mined compositionally by the lexical aspect of the verb and the grammatical features of the verbal suffixes Figure lists the basic rules for tense in Mapudungun Fre-quently tense is marked in Spanish where it is unmarked in Mapudungun Consequently tense must be teased out of the dispersed clues Mapudungun provides Since tense is determined taking into account information from both the verb (V) and the VSuffG tense assignment is man-aged by rules called VBar rues in the grammar that com-bine V and VSuffG constituents For instance Figure displays a simplified version of the rule that assigns the past tense feature when necessary (transfer of features from Mapudungun to Spanish is not represented in the rule for brevity) Analogous rules deal with the other tem-poral specifications

22243 Typological divergenceAs a third example of overcoming a translation chal-

lenge in our manually written Mapudungun-Spanish MT system we take the transfer of Mapudungun morphologi-cal constructions that require an auxiliary or light verb on the Spanish side A consequence of the heavy agglutina-tion in Mapudungun is that many grammatical construc-tions that are expressed by morphological means in Ma-

pudungun are grammaticalized using syntax in Spanish For instance many nouns and adjectives in Mapudungun can be converted to verbs through (possibly zero) suffixa-tion But the meanings of these Mapudungun denominal and deadjectival verbs are often expressed with lsquolight verb + nounrsquo syntactic construction in Spanish See Simi-larly passive voice in Mapudungun is marked by the ver-bal suffix -nge While passive voice in Spanish as well as in English requires an auxiliary verb which carries tense and agreement features and a passive participle ()

(2) Mapudungun kofke-tu -nbread-verbalizer -1sgsubjindic

Spanish com-o paneat -1sgpresindic bread

English lsquoI eat breadrsquo

(3) Mapudungun pe -nge -nsee -passive -1sgsubjindic

Spanish fui vistoAux1sgpastinidc seepassPart

English lsquoI was seenrsquo

In a syntactic transfer system both the Spanish light verb construction as well as the Spanish passive require the introduction of an additional verb on the Spanish side The rule for passive a VBar rule in our manually written grammar has to insert the auxiliary assign the auxiliary

Figure 7 Abstract rules for the interpretation of Tense in Mapudungun

Tense Aspect

pe -a -nsee future 1sgsubjindic

niye -nown 1sgsubjindicI ownkellu -ke -nhelp habitual 1sgsubjindicI helpkellu -nhelp 1sgsubjindicI helped

Grammatical Features Lexical Aspect

Temporal Interpretation

Future Any Any Future

Any Stative Present

I will see

Example

Habitual

Unmarked

Unmarked

Unmarked

Present

PastUnmarked

Unmarked

Unmarked

VBar1VBarVBar [V VSuffG] -gt [V]((X1Y1)((X2 tense) = UNDEFINED)((X1 lexicalaspect) = UNDEFINED)((X2 aspect) = (NOT habitual))((X0 tense) = past) hellip)

alignment SL constraint on morphological tenseSL constraint on verbrsquos aspectual classSL constraint on grammatical aspectSL tense feature assignment

Figure 8 Past tense rule (transfer of features omitted)

Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 061206
Fix ref
Carnegie Mellon University 051206
Why is tense assigned on the Mapudungun side We need to acknowledge this fact and explain its rationale
Carnegie Mellon University 051206
Fix ref
Carnegie Mellon University 051206
Fix ref

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 12: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

the correct grammatical features and inflect the verb as a passive participle Figure 9 shows a simplified version of the rule that produces this result

2225 Spanish Morphology GenerationAmong the three languages involved in our MT sys-

tems for indigenous South American Languages it is not either of the two indigenous languages Mapudungun or Quechua but rather Spanish that has the simplest mor-phology Even so modularizing a Spanish morphological component greatly simplifies the translation lexicon With a Spanish morphological generation system that can in-flect Spanish words according to the relevant grammatical features the translation lexicon neednrsquot contain every in-flected form of each Spanish word but rather just Spanish stems

In order to build a Spanish morphological generation module we obtained a morphologically inflected dictio-nary from the Universitat Politegravecnica de Catalunya (UPC) in Barcelona under a research license Under each Spanish citation form in this dictionary each inflected form of that Spanish lexeme is listed together with specification of the grammatical feature-values that inflected form marks () We then mapped the grammatical feature-values contained in the inflected dictionary into feature attribute and value pairs of the format that our MT system is ex-pecting This way as the run time transfer engine needs a particular inflected form of a Spanish word the transfer engine can pass uninflected form to the Spanish Morphol-ogy Generator and have it generate the appropriate sur-face inflected forms

When the Spanish morphological generation is inte-grated with the run-time transfer system the final rule-based Quechua-Spanish MT system produces output such as the following

sl kuumlmelen (Im fine)tl ESTOY BIENtree lt((S5 (VPBAR2 (VP1 (VBAR9 (V10 ES-

TOY) (V15 BIEN) ) ) ) ) )gt

sl ayudangelay (heshe were not helped)

tl NO FUE AYUDADAtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADA) ) ) ) ) ) )gt

tl NO FUE AYUDADOtree lt((S5 (VPBAR2 (VP2 (NEGP1 (LITERAL

NO) (VBAR3 (V11 FUE)(V8 AYUDADO) ) ) ) ) ) )gt

sl Kuan ntildei ruka (Johns house)tl LA CASA DE JUAN

tree lt((S12 (NP7 (DET10 LA) (NBAR2 (N8 CASA) ) (LITERAL DE)(NP4 (NBAR2 (N1 JUAN) ) ) ) ) )gt

223 EvaluationThings that need to go in hereHand evaluationBlue NIST and Meteor scoresCan we evaluate the EBMT system as well on the

same test dataStatistical significance tests of the resultsAutomatic evaluation of a word to word translator

23 QuechuaFor the Quechua-Spanish language pair a small proto-

type RBMT system was manually developed Data collection for Quechua started in 2004 after the

AVENUE team established collaboration with bilingual speakers in Cusco (Peru) In 2005 one of the authors (Ariadna Font Llitjoacutes) spent three months in Cusco to learn Quechua and to set up the necessary basic infrastruc-ture for the development of a Quechua-Spanish MT proto-type system One of the main goals for building a manual RBMT system for this language pair was to test the ef-fects of an Automatic Rule Refiner on a small RBMT sys-tem with a resource-scarce language The AVENUE Auto-matic Rule Refiner improves both the grammar and the lexicon of RBMT systems guided by post-editing infor-

VBar6VBarVBar [V VSuffG] -gt [V V]((X1Y2)((X2 voice) =c passive) ((Y1 person) = (Y0 person))((Y1 number) = (Y0 number))((Y1 mood) = (Y0 mood))((Y2 number) =c (Y1 number))((Y1 tense) = past)((Y1 form) =c ser)((Y2 mood) = part) hellip)Insertion of aux in Spanish sideMapudungun verb (X1) aligned to Spanish verb (Y2)Mapudungun voice constraint Passing person features to aux in SpanishPassing number features to aux in SpanishPassing mood features to aux in SpanishSpanish-side agreement constraintAssigning tense feature to aux in SpanishSpanish auxiliary selectionSpanish-side verb form constraint

Figure 9 Simplified passive voice transfer rule Where Spanish uses a syntactic construction to produce passive voice the non-indoeuropean language Mapudungun marks passive voice with a verbal suffix Hence an auxiliary verb must be inserted on the Spanish side The auxiliary verb is the first V constituent on production side of the transfer rule This rule then ensures the person number mood and tense features are marked on the auxiliary in Spanish while the Spanish content verb appears as a passive particle The transfer of features from Mapudungun to Spanish is not represented in this rule for brevity

Carnegie Mellon University 071206
Cite output from the actual current system
Carnegie Mellon University 071206
Ref to httpwwwlsiupces~nlpfreelingparole-eshtml

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 13: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

mation from non-expert bilingual speakers (Font Llitjoacutes et al 2005a)

For this language pair translation and morphology lexicons were automatically created from data annotated by a native speaker and were later manually revised Tak-ing the Elicitation Corpus (section N) as a reference we developed a small translation grammar manually

231 Text Corpora As part of the data collected for Quechua the AVENUE

Elicitation Corpora (EC) were translated and manually aligned by a both a native Quechua speaker and a linguist with good knowledge of Quechua The EC is used when there is no natural corpus large enough to use for develop-ment of MT The EC resembles a fieldwork questionnaire containing simple sentences that elicit specific meanings and structures It has two parts The first part the Func-tional Elicitation Corpus contains sentences designed to elicit functionalcommunicative features such as number person tense and gender The version that was used in Peru had 1700 sentences The second part the Structural Elicitation Corpus is a smaller corpus designed to cover the major structures present in the Penn Treebank (Marcus et al 1992) Out of 122176 sentences from the Brown Corpus section of the Penn Treebank 222 different basic structures and substructures were extracted namely 25 AdvPs 47 AdjPs 64 NPs 13 PPs 23 SBARs and 50 Ss For more information about how this corpus was created and what its properties are see Probst and Lavie (2004) The final Structural Elicitation Corpus which was trans-lated into Quechua had 146 Spanish sentences

Besides the Elicitation Corpora there was no other Quechua text readily available on electronic format and thus three books which had parallel text in Spanish and Quechua were scanned Cuento Cusquentildeos Cuentos de Urubamba and Gregorio Condori Mamani Quechua speakers examined the scanned Quechua text (360 pages) and corrected the optical character recognition (OCR) er-rors with the original image of the text as a reference

232 A Rule-Based MT PrototypeSimilar to the Mapudungun-Spanish system the

Quechua-Spanish system also contains a Quechua mor-phological analyzer which pre-processes the input sen-tences to split words into roots and suffixes The lexicon and the rules are applied by the transfer engine and fi-nally the Spanish morphology generation module is called to inflect the corresponding Spanish stems with the relevant features (Section )

2321 Morphology and Translation LexiconsIn order to build a translation and morphology lexicon

the word types from the three Quechua books were ex-tracted and ordered by frequency The total number of types was 31986 (Cuento Cusquentildeos 9988 Cuentos de Urubamba 12223 Gregorio Condori Mamani 12979) with less than 10 overlap between books Only 3002 word types were in more than one book1 Since 16722 word types were only seen once in the books (singletons) we decided to segment and translate only the 10000 most frequent words in the list hoping to reduce the number of OCR errors and misspellings Additionally all the unique

1 This was done before the OCR correction was completed and thus this list contained OCR errors

word types from one of the versions of the Elicitation Corpora were also extracted (1666 types) to ensure basic coverage

10000 words were segmented and translated by a na-tive Quechua speaker For each word the native Quechua speaker was asked to provide information for all the fields listed in Figure 10

1 Word Segmentation2 Root translation3 Root Part-of-Speech4 Word Translation 5 Word Part-of-Speech6 Translation of the final root

Figure 10 Fields required to automatically build a translation and morphology lexicons

The native speaker was asked to provide the Transla-tion of the final root (last field in Figure 10) only if there had been a Part-of-Speech (POS) change The reason we needed this is that if the POS fields for the root and the word differ (3 and 5) the translation of the final root might have changed and thus the translation in the lexical entry actually needs to be different from the translation of the root In Quechua this is important for words such as ldquomachuyanirdquo (I ageget older) where the root ldquomachurdquo is an adjective meaning ldquooldrdquo and the word is a verb whose root really means ldquoto get oldrdquo (ldquomachuyayrdquo)2 Instead of having a lexical entry like V-machuy-viejo (old) we are interested in having a lexical entry V-machu(ya)y-enveje-cer (to get old)

From the list of 10000 segmented and translated words a stem lexicon was automatically generated and manually corrected For example from the word type ldquochayqardquo and the specifications given for all the other fields as shown in Figure 11 six different lexical entries were automatically created one for each POS and each al-ternative translation (Pron-ese Pron-esa Pron-eso Adj-ese Adj-esa Adj-eso) In some cases when the word has a different POS it actually is translated differently in Spanish For these cases the native speaker was asked to use || instead of | and the post-processing scripts were de-signed to check for the consistency of || in both the trans-lation and the POS fields

2 -ya- is a verbalizer in Quechua

aria 181206
We probable want to move this to the first time we describe the Elicitation Corpushellip

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 14: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

The scripts allow for fast post-processing of thousands of words however manual checking is still required to make sure that no spurious lexical entries have been cre-ated Some examples of automatically generated lexical entries are presented in Figure 12 Suffix lexical entries however were hand-crafted see Figure 13 For the cur-

Word Segmentation Root translation Root POS Word Translation Word POS chayqa chay+qa ese | esa | eso Pron | Adj ese || es ese Pron || VP Figure 11 Example of word segmented and translated

V | [ni] -gt [decir]((X1Y1))

N | [pacha] -gt [tiempo]((X1Y1))

N | [pacha] -gt [tierra]((X1Y1))

Pron | [noqa] -gt [yo]((X1Y1))

Interj | [alli] -gt [a pesar] ((X1Y1)) Adj | [hatun] -gt [grande]((X1Y1))

Adj | [hatun] -gt [alto]((X1Y1))

Adv | [kunan] -gt [ahora] ((X1Y1))

Adv | [allin] -gt [bien]

((X1Y1))

Adv | [ama] -gt [no] ((X1Y1))

Figure 12 Automatically generated lexical entries from segmented and translated word list

dicen que on the Spanish side SuffSuff | [s] -gt []((X1Y1)((x0 type) = reportative))

when following a consonantSuffSuff | [si] -gt []((X1Y1)((x0 type) = reportative))

SuffSuff | [qa] -gt []((X1Y1) ((x0 type) = emph))

SuffSuff | [chu] -gt []((X1Y1)((x0 type) = interr))VSuffVSuff | [nki] -gt []((X1Y1)((x0 person) = 2)((x0 number) = sg)((x0 mood) = ind)((x0 tense) = pres) ((x0 inflected) = +))

NSuffNSuff | [kuna] -gt []((X1Y1)((x0 number) = pl))

NSuffPrep | [manta] -gt [de]((X1Y1)((x0 form) = manta))

Figure 13 Manually written suffix lexical entries

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 15: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

rent working MT prototype the Suffix Lexicon has 36 en-tries Cusihuamanrsquos grammar (2001) lists a total of 150 suffixes

2322 Translation RulesThe translation grammar written with comprehensive

rules following the same formalism described in subsec-tion 22231 above contains 25 rules and it covers sub-ject-verb agreement agreement within the NP (Det-N and N-Adj) intransitive VPs copula verbs verbal suffixes nominal suffixes and enclitics Figure 14 shows a couple of examples of rules in the translation grammar and Fig-ure 15 shows a few correct translations as output by the Quechua-Spanish MT system The MT output is the re-sult of inflecting the Spanish citation forms using the Morphological Generator discussed in section

233 Preliminary User StudiesA preliminary user study of the correction of Quechua

to Spanish translations was conducted where three Quechua speakers with good knowledge of Spanish evalu-ated and corrected nine machine translations when neces-sary The correction of MT output is the first stew towards automatic rule refinement and was done through a user-friendly interface called the Translation Correction Tool (Font Llitjoacutes and Carbonell 2004) This small user study allowed us to see how Quechua speakers used the online tool and whether they had any problems with the inter-face It showed that the Quechua representation of stem

and suffixes as separate words does not seem to pose a problem and that it was relatively easy to use for non-technical users

Figure 14 Manually written grammar rules for Quechua-Spanish translation

S2SS [NP VP] -gt [NP VP]( (X1Y1) (X2Y2)

((x0 type) = (x2 type)) ((y1 number) = (x1 number)) ((y1 person) = (x1 person)) ((y1 case) = nom)

subj-v agreement ((y2 number) = (y1 number)) ((y2 person) = (y1 person))

subj-embedded Adj agreement ((y2 PredAdj number) = (y1 number)) ((y2 PredAdj gender) = (y1 gender)))SBar1SBarSBar [S] -gt [Dice que S]( (X1Y2) ((x1 type) =c reportative) )

VBar4VBarVBar [V VSuff VSuff] -gt [V]( (X1Y1) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) = (NOT ger)) ((x3 inflected) =c +) ((x0 inflected) = +) ((x0 tense) = (x2 tense)) ((y1 tense) = (x2 tense)) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 mood) = (x3 mood)))

sl taki sha ra ni (I was singing)tl ESTUVE CANTANDOtree lt((S1 (VP0 (VBAR5 (V00 ESTUVE) (V21 CANTANDO) ) ) ) )gt

sl taki ra n si (it is said that she sang)tl DICE QUE CANTOacutetree lt((SBAR1 (LITERAL DICE QUE) (S1 (VP0 (VBAR1 (VBAR4 (V21

CANTOacute) ) ) ) ) ) )gt

sl noqa qa barcelona manta ka ni (I am from Barcelona)tl YO SOY DE BARCELONAtree lt((S2 (NP6 (NP1 (PRONBAR1 (PRON01 YO) ) ) ) (VP3 (VBAR2

(V35 SOY) ) (NP5 (NSUFF14 DE) (NP2 (NBAR1 (N23 BARCE-LONA) ) ) ) ) ) )gt

Figure 14 Manually written grammar rules for Quechua-Spanish translation

Figure 15 MT output form Quechua-Spanish Rule-Based MT prototype

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 16: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

24 HebrewThe AVENUE projectrsquos omnivorous approach to ma-

chine translation development was conceived to aid the creation of MT systems for any language with scarce computational resources As discussed in the previous two sections many indigenous clearly belong to the category of resource scarce languages In addition to the indigenous South American languages Mapudungun and Quechua the AVENUE project has successfully applied omnivorous MT creation strategies to a number of languages that pose similar challenges in computational resource availability This section and the next describe our application of om-nivorous MT system development to two such languages Hebrew and Hindi

Israeli Hebrew (also known as Modern Hebrew henceforth Hebrew) is one of the two official languages of the State of Israel It is spoken natively by about half of the population and fluently by virtually all the over six million residents of the country Hebrew exhibits clear Semitic behavior In particular its lexicon word forma-tion and inflectional morphology are typically Semitic The major word formation machinery is root-and-pattern where roots are sequences of three (typically) or more consonants and patterns are sequences of vowels and sometimes also consonants with ldquoslotsrdquo into which the roots consonants are inserted Inflectional morphology is highly productive and consists mostly of suffixes but also prefixes and circumfixes

Hebrew rates slightly higher than Mapudungun and Quechua on a scale measuring the availability of compu-tational language resources Still the paucity of existing available resources for Hebrew is a challenge to building MT systems (Wintner 2004) The collection of corpora for Hebrew is still in early stages (Wintner and Yona 2003) and all existing significant corpora are monolin-gual Hence the use of aligned bilingual corpora for MT purposes is currently not a viable option There is no available large Hebrew language model which could help in disambiguation Good morphological analyzers are pro-prietary and publicly available ones are limited (Wintner 2004) No publicly available bilingual dictionaries cur-rently exist and no grammar is available from which transfer rules can be extracted Still we made full use of existing resources which we adapted and augmented to fit our needs

Machine translation of Hebrew is also challenging be-cause of ambiguity and incomplete standardization of the Hebrew orthography Hebrew and its orthography are highly ambiguous both at the lexical and morphological levels The Hebrew script3 not unlike that for Arabic at-taches several short particles to immediately following words Among others these particles include the definite article H (ldquotherdquo) prepositions such as B (ldquoinrdquo) K (ldquoasrdquo) L (ldquotordquo) and M (ldquofromrdquo) subordinating conjunctions such as $ (ldquothatrdquo) and K$ (ldquowhenrdquo) relativizers such as $ (ldquothatrdquo) and the coordinating conjunction W (ldquoandrdquo) These attached prefix particles introduce ambiguity into the script as the same character sequences can often also be parts of legitimate stems Thus a form such as MHGR can be read as a lexeme ldquoimmigrantrdquo as M-HGR ldquofrom

3 To facilitate readability we use a transliteration of Hebrew using ASCII characters in this paper

Hagarrdquo or even as M-H-GR ldquofrom the foreignerrdquo Note that there is no deterministic way to tell whether the first M of the form is part of the pattern the root or of a pre-fixing particle

An added complexity arises from the fact that there exist two main standards for the Hebrew script one in which vocalization diacritics known as niqqud ldquodotsrdquo decorate the words and another in which the dots are omitted but where other characters represent some but not all of the vowels Most of the modern printed and electronic texts in Hebrew use the ldquoundottedrdquo script While a standard convention for this script officially ex-ists it is not strictly adhered to even by the major news-papers and in government publications Thus the same word can be written in more than one way sometimes even within the same document This fact adds signifi-cantly to the degree of ambiguity and requires creative solutions for practical Hebrew language processing appli-cations Note that Mapudungun and Quechua similarly posed challenges over orthographic standardization He-brew differs from these two indigenous South American languages however in that the orthography as standard as it is is inherently highly ambiguous

The AVENUE project team has developed two prelimi-nary MT systems to translate from Hebrew into English Both systems are rule based relying on the AVENUE run-time-transfer engine described in section to produce translations The two constructed systems differ in the sets of rules the transfer engine interprets and in the manner these sets of rules were constructed The first MT system was an experiment in rapid development of hand-written syntactic transfer systems This hand-written system was developed over the course of a two-month period with a total labor-effort equivalent of about four person-months The manually written transfer rules reflect the most com-mon local syntactic differences between Hebrew and Eng-lish The small set of rules turns out to be already suffi-cient for producing some legible translations of newspaper texts To create the transfer rules in the second Hebrew-English MT system we applied an automatic transfer-rule learning approach (Carbonell et al 2002) For both of these systems we used existing publicly available re-sources which we adapted in novel ways for the MT task to directly addressed the major issues of lexical morpho-logical and orthographic ambiguity We have evaluated translation performance results for both rule-based sys-tems using state of the art measures Translation perfor-mance of these prototype MT systems is encouraging To the best of our knowledge our systems are the first broad-domain machine translation systems for Hebrew built any-where

As the Hebrew-English rule-based systems use the same MT framework as the Mapudungun and Quechua rule-based systems described earlier in this paper the He-brew systems consist of the same major components modules to handle source language morphology analysis and target language morphology generation a translation lexicon for the language pair and a collection of transfer rules that perform the translation itself In addition the Hebrew orthographic and encoding conventions require some preprocessing And in the final stage of translation the different possible English translations are fed into a decoder which uses a language model of English to search and select a combination of sequential translation seg-

Carnegie Mellon University 181206
We should devote a section of this paper to describe at least at a high level how sets of rules are learned
Carnegie Mellon University 181206
Fix ref
Carnegie Mellon University 121206
This Hebrew work really is a good fit for this paper I will need to focus on tying the Hebrew work into the context of the other systems1313Hebrew is ambiguous unlike Mapudungun13 Just as for Mapudungun computational resources had to be created1313The only significant problem that I see is how to compare the evaluation of this Hebrew system to the evaluation of the other systems particularly Mapudungun Is evaluation yet another omnivorous piece where one type of evaluation is appropriate in some circumstances but another type of evaluation is appropriate in others

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 17: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

ments that together represent the most likely translation of the entire input sentence We now describe each of the components in more detail

25 Hebrew Input Pre-processingOur Hebrew-English MT systems are currently de-

signed to process Hebrew input represented in Microsoft Windows Encoding While this is the most common en-coding in use for electronic Hebrew documents (such as online versions of newspapers) other encodings for He-brew (such as UTF-8) are also quite common The mor-phological analyzer we use (see next sub-section) was de-signed however to expect Hebrew in a Romanized (ASCII) representation We adopted this romanized form for all internal processing within our systems including the encoding of Hebrew in the lexicon and in the transfer rules The same Romanized transliteration is used for He-brew throughout this paper The main task of our pre-pro-cessing module is therefore to map the encoding of the Hebrew input to its romanized equivalent This should al-low us to easily support other encodings of Hebrew input in the future The pre-processing also includes simple treatment of punctuation and special characters

26 Hebrew Morphological AnalysisTo analyze the morphology of Hebrew input words we

used a publicly available morphological analyzer (Segal 1999) which produces all the possible analyses of each in-put word Analyses include the lexeme and a list of mor-pho-syntactic features such as number gender person tense etc The analyzer also identifies prefix particles which are attached to the word Having a morphological analyzer available for Hebrew enabled us to create an MT system with quite reasonable coverage By contrast other languages discussed in this paper such as Mapudungun and Quechua did not have easily available morphological analyzersmdashwhich resulted in our building within the AVENUE project morphological analyzers for those lan-guages But since our focus in the AVENUE project was not morphology alone the morphological analyzers we built in-house have comparatively poor coverage

While a morphological analyzer for Hebrew is help-fully available the Hebrew analyzer is only available to us as a ldquoblack-boxrdquo component which does not permit us to easily extend its lexical coverage or to incorporate spelling variants into the analyzers lexicon We have not addressed either of these problems so far Our experi-ments with development data indicate that at least for newspaper texts the overall coverage of the analyzer is in fact quite reasonable The texts we have used so far do not exhibit large amounts of vowel spelling variation but we have not quantified the magnitude of the problem very precisely

While the set of possible analyses for each input word comes directly from the analyzer we developed a novel representation for this set to support its efficient process-ing through our translation system The main issue ad-dressed is that the analyzer may split an input word into a sequence of several output lexemes by separating prefix and suffix lexemes Moreover different analyses of the same input word may result in a different number of out-put lexemes We deal with this issue by converting our set of word analyses into a lattice that represents the various sequences of possible lexemes for the word Each of the

lexemes is associated with a feature structure which en-codes the relevant morpho-syntactic features that were re-turned by the analyzer

As an example consider the word form B$WRH which can be analyzed in at least four ways the noun B$WRH (ldquogospelrdquo) the noun $WRH (ldquolinerdquo) prefixed by the preposition B (ldquoinrdquo) the same noun prefixed by the same preposition and a hidden definite article (merged with the preposition) and the noun $WR (ldquobullrdquo) with the preposition B as a prefix and an attached pronominal possessive clitic H (ldquoherrdquo) as a suffix Such a form would yield four different sequences of lexeme tokens which will all be stored in the lattice To overcome the limited lexicon and in particular the lack of proper nouns we also consider each word form in the input as an un-known word and add it to the lattice with no features This facilitates support of proper nouns through the translation dictionary Figure 2 graphically depicts the lattice repre-sentation of the various analyses and Figure 3 shows the feature-structure representation of the same analyses

B$WRHB $WRHB H $WRHB $WR H

Figure 2 Lattice Representation of a set of Analyses for the Hebrew Word B$WRH

Y0 ((SPANSTART 0) Y1 ((SPANSTART 0) Y2 ((SPANSTART 1)(SPANEND 4) (SPANEND 2) (SPANEND 3)(LEX B$WRH) (LEX B) (LEX $WR)(POS N) (POS PREP)) (POS N)(GEN F) (GEN M)(NUM S) (NUM S)(STATUS ABSOLUTE)) (STATUS ABSOLUTE))Y3 ((SPANSTART 3) Y4 ((SPANSTART 0) Y5

((SPANSTART 1)(SPANEND 4) (SPANEND 1) (SPANEND 2)(LEX $LH) (LEX B) (LEX H)(POS POSS)) (POS PREP)) (POS DET))Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)(SPANEND 4) (SPANEND 4)(LEX $WRH) (LEX B$WRH)(POS N) (POS LEX))(GEN F)(NUM S)(STATUS ABSOLUTE))

Figure 3 Feature-Structure Representation of a set of Analyses for the Hebrew Word B$WRH

While the analyzer of Segal (Segal 1999) comes with a disambiguation option its reliability is limited We pre-fer to store all the possible analyses of the input in the lat-tice rather than disambiguate since the AVENUE transfer engine can cope with a high degree of ambiguity and in-formation accumulated in the translation process can as-sist in ambiguity resolution later on during the decoding stage A ranking of the different analyses of each word could however be very useful For example the Hebrew word form AT can be either the (highly frequent) definite accusative marker the (less frequent) second person femi-

Carnegie Mellon University 191206
This doesnrsquot sem like a latticemdashthis is just a list of all the posible analices A lattice would at least merge the Brsquos and the $WRH
Carnegie Mellon University 151206
If this isnrsquot really a problem why spend a long time talking about it in the introduction
Carnegie Mellon University 261206
Revise this verbage
Carnegie Mellon University 261206
We also do this for Mapudungun and Hindi (although probable not for Quechua)

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 18: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

nine personal pronoun or the (extremely rare) noun ldquospaderdquo We currently give all these readings the same weight although we intend to rank them in the future

27 Word Translation LexiconThe status of the Hebrew-English translation lexicon is

similar to that of the Hebrew morphological analyzermdasha translation lexicon for Hebrew is freely available to us but it is not perfect The Hebrew-English bilingual word translation lexicon was constructed based on the Dahan dictionary (Dahan 1997) whose main benefit is that we were able to obtain it in a machine readable form This is a relatively low-quality low-coverage dictionary To ex-tend its coverage we use both the Hebrew-English section of the dictionary and the inverse of the English-Hebrew section The combined lexicon was enhanced with a small manual lexicon of about 100 entries containing some in-flected forms not covered by the morphological analyzer and common multi-word phrases whose translations are non-compositional

Significant work was required to ensure spelling vari-ant compatibility between the lexicon and the other re-sources in our system The original Dahan dictionary uses the dotted Hebrew spelling representation We developed scripts for automatically mapping the original forms in the dictionary into romanized forms consistent with the un-dotted spelling representation These handle most but not all of the mismatches Due to the low quality of the dictio-nary a fair number of entries require some manual edit-ing This primarily involves removing incorrect or awk-ward translations and adding common missing transla-tions Due to the very rapid system development time most of the editing done so far was based on a small set of development sentences Undoubtedly the dictionary is one of the main bottlenecks of our system and a better dic-tionary will improve the results significantly The final re-sulting translation lexicon is automatically converted into the lexical transfer rule format expected by our transfer engine A small number of lexical rules (currently 20) which require a richer set of unification feature con-straints are appended after this conversion

28 English Morphological GenerationHow English generation differs from Spanish in our

system(s)Since the transfer system currently does not use a mor-

phological generator on the target (English) side we use a dictionary enhancement process to generate the various morphological forms of English words with appropriate constraints on when these should be used The enhance-ment process works as follows for each English noun in the lexicon we create an additional entry with the Hebrew word unchanged and the English word in the plural form We also add a constraint to the lexical rule indicating that the number is plural A similar strategy is applied to verbs we add entries (with constraints) for past tense fu-ture tense present tense (3rd person singular) and infini-tive forms The result is a set of lexical entries associating Hebrew base forms to English inflected forms At run-time the features associated with the analyzed Hebrew word form are unified with the constraints in the lexical rules so that the appropriate form is selected while the other forms fail to unify

29 Two Hebrew-English Transfer GrammarsNP12 NP13SL $MLH ADWMH SL H $MLWT H AD-

WMWTTL A RED DRESS TL THE RED DRESSESScore2 Score4NP1NP1 [NP1 ADJ] -gt [ADJ NP1] NP1NP1 [NP1

H ADJ] -gt [ADJ NP1]( ((X2Y1) (X3Y1)(X1Y2) (X1Y2)((X1 def) = -) ((X1 def) = +)((X1 status) =c absolute) ((X1 status) =c absolute)((X1 num) = (X2 num)) ((X1 num) = (X3 num))((X1 gen) = (X2 gen)) ((X1 gen) = (X3 gen))(X0 = X1) (X0 = X1)) )

Figure 4 NP Transfer Rules for Nouns Modified by Adjectives from Hebrew to English

The transfer engine was designed to support both man-ually-developed structural transfer grammars and gram-mars that can be automatically acquired from bilingual data We used the currently available training algorithms for automatically learning a transfer grammar (Lavie et al 2003) and applied them to a word-aligned Hebrew-translated version of the elicitation corpus Some prelimi-nary performance results when using this automatically acquired grammar are reported in Section 4

As part of our two-month effort so far we developed a preliminary small manual transfer grammar which re-flects the most common local syntactic differences be-tween Hebrew and English The current grammar contains a total of 36 rules including 21 noun-phrase (NP) rules one prepositional-phrase (PP) rule 6 verb complexes and verb-phrase (VP) rules and 8 higher-phrase and sentence-level rules for common Hebrew constructions As we demonstrate in Section 4 this small set of transfer rules is already sufficient for producing reasonably legible trans-lations in many cases Grammar development took about two days of manual labor by a native bilingual speaker who is also a member of the system development team and is thus well familiar with the underlying formalism and its capabilities Figure 4 contains an example of trans-fer rules for structurally transferring nouns modified by adjectives from Hebrew to English The rules enforce number and gender agreement between the noun and the adjective They also account for the different word order exhibited by the two languages and the special location of the definite article in Hebrew noun phrases

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the

government instructions and immigration policein a letter in broken english which spread among the

foreign workers thanks to them hotel for their hard workand announced that will purchase for hm ight tickets

for their countries from their money

Figure 5 Select Translated Sentences from the Devel-opment Data

Carnegie Mellon University 261206
We ought to include at least a brief explanation of the rule learning system Just as we briefly explained EBMT and rule-based MT in general That may be time consuming
Carnegie Mellon University 271206
Fix

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 19: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

System BLEU NIST Precision RecallNo Grammar 00606 [0059900612] 34176

[3408034272] 03830 04153Learned Grammar 00775 [0076800782] 35397

[3529635498] 03938 04219Manual Grammar 01013 [0100401021] 37850

[3773337966] 04085 04241

Table 1 System Performance Results with the Various Grammars

210 Results and EvaluationThe current system was developed over the course of a

two month period and was targeted for translation of newspaper texts Most of this time was devoted to the construction of the bilingual lexicon and stabilizing the front-end Hebrew processing in the system (Morphology and input representation issues) Once the system was rea-sonably stable we devoted about two weeks of time to improving the system based on a small development set of data For development we used a set of 113 sentences from the Hebrew daily HaAretz Average sentence length was approximately 15 words Development consisted pri-marily of fixing incorrect mappings before and after mor-phological processing and modifications to the bilingual lexicon The small transfer grammar was also developed during this period Given the limited resources and the limited development time we find the results to be highly encouraging For many of the development input sen-tences translations are reasonably comprehensible Figure 5 contains a few select translation examples from the de-velopment data

To quantitatively evaluate the results achieved so far and to comparatively assess the performance of our man-ual transfer grammar and our automatically acquired grammar we tested the system on a set of 62 unseen sen-tences from HaAretz Three versions of the system were tested on the same data set a version using our manual transfer grammar a version using our current automati-cally-learned grammar and a version with no transfer grammar at all which amounts to a word-to-word transla-tion version of the system Results were evaluated using several automatic metrics for MT evaluation which com-pare the translations with human-produced reference translations for the test sentences For this test set two reference translations were obtained We use the BLEU (Papineni et al 2002) and NIST (Doddington 2002) au-tomatic metrics for MT evaluation We also include ag-gregate unigram-precision and unigram-recall as addi-tional reported measures The results can be seen in Table 1 To assess statistical significance of the differences in performance between the three versions of the system we apply a commonly used bootstrapping technique (Efron and Tibshirani 1986) to estimate the variability over the test set and establish confidence intervals for each re-ported performance score As expected the manual gram-mar system outperforms the no-grammar system accord-ing to all the metrics The learned grammar system scores fall between the scores for the other two systems indicat-ing that some appropriate structural transfer rules are in fact acquired All results are statistically highly signifi-cant

Abstract

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specifically designed for rapid MT prototyping for languages with limited linguistic re-sources The task is particularly challenging due to two main reasons the high lexical and morphological ambigu-ity of Hebrew and the dearth of available resources for the language Existing publicly available resources were adapted in novel ways to support the MT task The methodology behind the system combines two separate modules a transfer engine which produces a lattice of possible translation segments and a decoder which searches and selects the most likely translation according to an English language model We demonstrate that a small manually crafted set of transfer rules suffices to pro-duce legible translations Performance results are evalu-ated using state of the art measures and are shown to be encouraging

211 HindiLittle enough natural language processing work had

been done with Hindi in YYYY for DARPA to designate Hindi as the language to build MT systems for in a task designed to build MT systems for languages with fewer resources

We describe an experiment designed to evaluate the ca-pabilities of our trainable transfer-based (Xfer ) machine translation approach as applied to the task of Hindi-to-English translation and trained under an extremely lim-ited data scenario We compare the performance of the Xfer approach with two corpus-based approaches Statis-tical MT (SMT) and Example-based MT (EBMT) under the limited data scenario The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario Results also indicate that automatically learned transfer rules are effective in improving transla-tion performance compared with a baseline word-to-word translation version of the system Xfer system perfor-mance with a limited number of manually written transfer rules is however still better than the current automati-cally inferred rules Furthermore a multi-engine version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outper-formed both individual systems Categories and Subject Descriptors I27 [Arti_cial Intelligence] Natural Lan-guage Processing| Machine TranslationGeneral Terms Evaluation Hindi Machine Translation Ad-ditional Key Words and Phrases Example-based Machine Translation Limited Data Resources Machine Learning Multi-Engine Machine Translation Statistical Translation Transfer Rules

1 INTRODUCTION

Corpus-based Machine Translation (MT) ap-proaches such as Statistical Machine Transla-tion (SMT) Brown et al 1990 Brown et al 1993 Vogel and Tribble 2002 Yamada and Knight 2001 Papineni et al 1998 Och and Ney ] and Example-based Machine Translation (EBMT) [Brown 1997 Sato and Nagao 1990] have received much attention in recent years

Katharina Probst 101706
what year
Carnegie Mellon University 271206
Reread the entire Hebrew section and see if the iacutetems from this Abstract for the Hebrew paper are adequately covered
Carnegie Mellon University 271206
Refine and connect this section to Mapudungun once the Mapudungun evaluation has been completed

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 20: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

and have signi_cantly improved the state-of-the-art of Machine Translation for a number of di_erent language pairs These approaches are attractive because they are fully auto-mated and require orders of magnitude less human labor than traditional rule-based MT approaches However to achieve reasonable levels of translation performance the corpus-based methods require very large volumes of sentence-aligned parallel text for the two lan-guages on the order of magnitude of a mil-lion words or more Such resources are only currently available for only a small number of language pairs While the amount of online re-sources for many languages will undoubtedly grow over time many of the languages spo-ken by smaller ethnic groups and populations in the world will not have such resources within the forseeable future Corpus-based MT approaches will therefore not be e_ective for such languages for some time to comeOur MT research group at Carnegie Mellon under DARPA and NSF funding has been working on a new MT approach that is speci_cally designed to enable rapid develop-ment of MT for languages with limited amounts of online resources Our approach assumes the availability of a small number of bilingual speakers of the two languages but these need not be linguistic experts The bi-lingual speakerscreate a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool From this data the learning module of our system automatically infers hierarchical syntactic transfer rules which encode how constituent structures in the source language transfer to the target language The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language We re-fer to this system as the ldquoTrainableTransfer-based MT System or in short the Xfer systemThe DARPA-sponsored Surprise Language Ex-ercise (SLE) of June 2003 provided us with a golden opportunity to test out the capabilities of our approach The Hindi-to-English system that we developed in the course of this exer-cise was the _rst large-scale open-domain test for our system Our goal was to compare the performance of our Xfer system with the cor-pus-based SMT and EBMT approaches devel-oped both within our group at Carnegie Mel-lon and by our colleagues elsewhere Com-mon training and testing data were used for the purpose of this cross-system comparison The data was collected throughout the SLE during the month of June 2003 As data re-

sources became available it became clear that the Hindi-to-English was in fact not a ldquolimited-data situation By the end of the SLE over 15 million words of parallel Hindi-English text had been collected which was su_cient for development of basic-quality SMT and EBMT systems In a common evaluation con-ducted at the end of the SLE the SMT systems that participated in the evaluation outper-formed our Xfer system as measured by the NIST automatic MT evaluation metric [Dod-dington 2003] Our system received a NIST score of 547 as compared with the best SMT system which received a score of 761Our intention however was to test our Xfer system under a far more limited data scenario than the one that had developed by the end of the SLE [Nirenburg1998 Sherematyeva and Nirenburg 2000 Jones and Havrilla 1998]) We therefore de-signed an ldquoarti_cial extremely limited data scenario where we limited the amount of available training data to about 50 thousand words of word-aligned parallel text that we had collected during the SLE We then de-signed a controlled experiment in order to compare our Xfer system with our in-house SMT and EBMT systems under this limited data scenario The design execution and re-sults of this experiment are the focus of this paper The results of the experiment indicate that under these extremely limited training data conditions when tested on unseen data the Xfer system signi_cantly outperforms both EBMT and SMTSeveral di_erent versions of the Xfer system were tested Results indicated that automati-cally learned transfer rules are e_ective in im-proving translation performance compared with a baseline word-to-word translation ver-sion of our system System performance with a limited number of manually written transfer rules was however still better than the cur-rent automatically inferred rules Furthermore aldquomulti-engine version of our system that combined the output of the Xfer and SMT sys-tems and optimizes translation selection out-performed both individual systemsThe remainder of this paper is organized as follows Section 2 presents an overview of the Xfer system and its components Section 3 de-scribes the elicited data collection for Hindi-English that we conducted during the SLE which provided the bulk of training data for our limited data experiment Section 4 de-scribes the speci_c resources and components that were incorporated into our Hindi-to-Eng-lish Xfer system Section 5 then describes the controlled experiment for comparing the Xfer EBMT and SMT systems under the limited data

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 21: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

scenario and the results of this experiment Finally Section 6 describes our conclusions and future research directions

Fig 1 Architecture of the Xfer MT System and its Major Components

2 TRAINABLE TRANSFER-BASED MT OVERVIEW

The fundamental principles behind the design of our Xfer approach for MT are that it is possi-ble to automatically learn syntactic transfer rules from limited amounts of word-aligned data that such data can be elicited from non-expert bilingual speakers of the pair of lan-guages and that the rules learned are useful for machine translation between the two lan-guages We assume that one of the two lan-guages involved is a ldquomajor language (such as English or Spanish) for which signi_cant amounts of linguistic resources and knowl-edge are availableThe Xfer system consists of four main sub-sys-tems elicitation of a word aligned parallel cor-pus automatic learning of transfer rules the run time transfer system and a statistical de-coder for selection of a _nal translation output from a large lattice of alternative translation fragments produced by the transfer system Figure 1 shows how the four sub-systems are used in a con_guration in which the transla-tion is from a limited-resource source lan-guage into a major target language such as English

21 Elicitation of Word-Aligned Parallel Data

The purpose of the elicitation sub-system is to collect a high quality word aligned parallel corpus A specially designed user interface was developed to allow bilingual speakers to easily translate sentences from a corpus of the major language (ie English) into their na-tive language (ie Hindi) and to graphically annotate the word alignments between the two sentences Figure 2 contains a snap-shot of the elicitation tool as used in the transla-tion and alignment of an English sentence into Hindi The informant must be bilingual and lit-erate in the language of elicitation and the language being elicited but does not need to have knowledge of linguistics or computa-tional linguisticsThe word-aligned elicited corpus is the pri-mary source of data from which transfer rules are inferred by our system In order to support e_ective rule learning we designed a ldquocon-trolled English elicitation corpus The design of this corpus was based on elicitation princi-

ples from _eld linguistics and the variety of phrases and sentences attempts to cover a wide variety of linguistic phenomena that the minor language may or may not possess The elicitation process is organized along ldquominimal pairs which allows us to identify whether the minor languages possesses speci_c linguistic phenomena (such as gender number agree-ment etc) The sentences in the corpus are ordered in groups corresponding to con-stituent types of increasing levels of complex-ity The ordering supports the goal of learning compositional syntactic transfer rules For ex-ample simple noun phrases are elicited be-fore prepositional phrases and simple sen-tences so that during rule learning the sys-tem can detect cases where transfer rules for NPs can serve as components within higher-level transfer rules for PPs and sentence struc-tures The current controlled elicitation corpus contains about 2000 sentences It is by design very limited in vocabulary A more detailed description of the controlled elicitation corpus the elicitation process and the interface tool used for elicitation can be found in [Probst et al 2001] [Probst and Levin 2002]

22 Automatic Transfer Rule Learning

The rule learning system takes the elicited word-aligned data as input Based on this in-formation it then infers syntactic transfer rules The learning system also learns the composition of transfer rules In the composi-tionality learning stage the learning system identi_es cases where transfer rules for ldquolower-level constituents (such as NPs) can serve as components within ldquohigher-level transfer rules (suchas PPs and sentence structures) This process generalizes the applicability of the learned transfer rules and captures the compositional makeup of syntactic correspondences be-tween the two languages The output of the rule learning system is a set of transfer rules that then serve as a transfer grammar in the run-time system The transfer rules are com-prehensive in the sense that they include all information that is necessary for parsing transfer and generation In this regard they di_er from `traditional transfer rules that ex-clude parsing and generation information De-spite this di_erence we will refer to them as transfer rulesThe design of the transfer rule formalism itself was guided by the consideration that the rules must be simple enough to be learned by an automatic process but also powerful enough to allow manually-crafted rule additions and changes to improve the automatically learned rules

Carnegie Mellon University 271206
Update this ref

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 22: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

The following list summarizes the components of a transfer rule In general the x-side of a transfer rules refers to the source language (SL) whereas the y-side refers to the target language (TL)

-Type information This identi_es the type of the transfer rule and in most cases corre-sponds to a syntactic constituent type Sen-tence rules are of type S noun phrase rules of type NP etc The formalism also allows for SL and TL type information to be di_erent

-Part-of speechconstituent information For both SL and TL we list a linear sequence of components that constitute an instance of the rule type These can be viewed as the `right-hand sides of context-free grammar rules for both source and target language grammars The elements of the list can be lexical cate-gories lexical items andor phrasal cate-gories

-Alignments Explicit annotations in the rule describe how the set of source language com-ponents in the rule align and transfer to the set of target language components Zero alignments and many-to-many alignments are allowed

-X-side constraints The x-side constraints pro-vide information about features and their val-ues in the source language sentence These constraints are used at run-time to determine whether a transfer rule applies to a given in-put sentence PASSIVE SIMPLE PRESENTVPVP [V V Aux] -gt [Aux being V]((X1Y3)((x1 form) = part)((x1 aspect) = perf)((x2 form) = part)((x2 aspect) = imperf)((x2 lexwx) = jAnA)((x3 lexwx) = honA)((x3 tense) = pres)((x0 tense) = (x3 tense))(x0 = x1)((y1 lex) = be)((y1 tense) = pres)((y3 form) = part))Fig 3 A transfer rule for present tense verb sequences in the passive voice

-Y-side constraints The y-side constraints are similar in concept to the xside constraints but they pertain to the target language At run-time y-side constraints serve to guide and constrain the generation of the target lan-guage sentence

-XY-constraints The xy-constraints provide in-formation about which feature values transfer from the source into the target language Speci_c TL words can obtain feature values from the source language sentence

For illustration purposes Figure 3 shows an example of a transfer rule for translating the verbal elements of a present tense sentence in the passive voice This rule would be used in translating the verb sequence bheje jAte h~ai (ldquoare being sent literally sent going present-tense-auxiliary) in a sentence such as Ab-tak patrdAk se bheje jAte h~ai (ldquoLetters still are being sent by mail literally up-to-now letters mail by sent going present-tense-auxiliary) The x-side constraints in Figure 3 show that the Hindi verb sequence consists of a perfective participle (x1) the passive auxil-iary (jAnA ldquogo) inflected as an imperfective participle (x2) and an auxiliary verb in the present tense (x3) The y-side constraints show that the English verb sequence starts with the auxiliary verb be in the present tense (y1) The second element in the English verb sequence is being whose form is invariant in this context The English verb sequence ends with a verb in past participial form (y3) The alignment (X1Y3) shows that the _rst ele-ment of the Hindi verb sequence corresponds to the last verb of the English verb sequenceRules such as the one shown in Figure 3 can be written by hand or learned automatically from elicited data Learning from elicited data proceeds in three stages the _rst phase Seed Generation produces initial `guesses at transfer rules The rules that result from Seed Generation are `flat in that they specify a se-quence of parts of speech and do not contain any non-terminal or phrasal nodes The sec-ond phase Compositionality Learning adds structure using previously learned rules For instance it learns that sequences such as Det N P and Det Adj N P can be re-written more generally as NP P as an expansion of PP in Hindi This generalization process can be done automatically based on the flat version of therule and a set of previously learned transfer rules for NPsThe _rst two stages of rule learning result in a collection of structural transfer rules that are context-freemdashthey do not contain any uni_cation constraints that limit their applica-bility Each of the rules is associated with a collection of elicited examples from which the rule was created The rules can thus be aug-mented with a collection of uni_cation con-straints based on speci_c features that are extracted from the elicited examples The constraints can then limit the applicability of the rules so that a rule may succeed only

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 23: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

for inputs that satisfy the same uni_cation constraints as the phrases from which the rule was learned A constraint relaxation technique known as Seeded Version Space Learning at-tempts to increase the generality of the rules by identifying uni_cation constraints that can be relaxed without introducing translation er-rors Detailed descriptions of the rule learning process can be found in [Probst et al 2003]

23 The Runtime Transfer System

At run time the translation module translates a source language sentence into a target lan-guage sentence The output of the run-time system is a lattice of translation alternatives The alternatives arise from syntactic ambigu-ity lexical ambiguity multiple synonymous choices for lexical items in the dictionary and multiple competing hypotheses from the rule learnerThe runtime translation system incorporates the three main processes involved in transfer-based MT parsing of the SL input transfer of the parsed constituents of the SL to their cor-responding structured constituents on the TL side and generation of the TL output All three of these processes are performed based on the transfer grammarmdashthe comprehensive set of transfer rules that are loaded into the runtime system In the _rst stage parsing is performed based solely on the x side of the transfer rules The implemented parsing algo-rithm is for the most part a standard bottom-up Chart Parser such as described in [Allen 1995] A chart is populated with all con-stituent structures that were created in the course of parsing the SL input with the source-side portion of the transfer grammar Transfer and generation are performed in an inte-grated second stage A dual TL chart is con-structed by applying transfer and generation operations on each and every constituent en-try in the SL parse chart The transfer rules as-sociated with each entry in the SL chart are used in order to determine the corresponding constituent structure on the TL side At the word level lexical transfer rules are accessed in order to seed the individual lexical choices for the TL word-level entries in the TL chart Finally the set of generated TL output strings that corresponds to the collection of all TL chart entries is collected into a TL lattice which is then passed on for decoding A more detailed description of the runtime transfer-based translation sub-system canbe found in [Peterson 2002]

24 Target Language Decoding

In the _nal stage a statistical decoder is used in order to select a single target language translation output from a lattice that repre-sents the complete set of translation units that were created for all substrings of the in-put sentence The translation units in the lat-tice are organized according the positional start and end indices of the input fragment to which they correspond The lattice typically contains translation units of various sizes for di_erent contiguous fragments of input These translation units often overlap The lattice also includes multiple word-to-word (or word-to-phrase) translations reflecting the ambiguity in selection of individual word translationsThe task of the statistical decoder is to select a linear sequence of adjoining but non-over-lapping translation units that maximizes the probability of p(ejf) where f = f1fJ is the source sentence and e = e1eI is the se-quence of target language words According to Bayes decision rule we have to search for

^e = argmaxep(ejf) = argmaxep(fje)p(e) (1)

The language model p(e) describes how well-formed a target language sentence e is We use a standard trigram model

p(e) =I Y1p(eijei10485762ei10485761) (2)

The translation probability p(fje) for the entire target sentence is the product of the transla-tion probabilities of the individual translation units We use the so called ldquoIBM-1 alignment model [Brown et al 1993] to train a statistical lexicon Phrase-to-phrase translation probabili-ties are then calculated using this lexicon

p( ~ fj~e) =Yj Xip(fj jei) (3)

where the product runs over all words in the source phrase and sum over all words in the target phrase For any possible sequence s of non-overlapping translation units which fully cover the source sentence the total transla-tion model probabilityis then

p(fje) = Ysp( ~ fj~e) =Ys Yj Xip(fsj jesi) (4)

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 24: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

The search algorithm considers all possible se-quences s in the lattice and calculates the product of the language model probability and the translation model probability for the re-sulting sequence of target words It then se-lects the sequence which has the highest overall probabilityAs part of the decoding search the decoder can also perform a limited amount of re-order-ing of translation units in the lattice when such reordering results in a better _t to the target language model Reordering is per-formed by skipping over several words in the source sentence ie leaving a gap translat-ing the word or phrase further towards the end of the sentence and _lling in the gap af-terwards The word re-orderings are scored using a Gaussian probability distribution ie longer movements are less likely than shorter ones with mean zero and a variance opti-mized on a development test set To keep de-coding time limited we use a beam search ie partial translation hypotheses which are low scoring compared to the best scoring hy-pothesis up to that point are pruned from fur-ther consideration

3 ELICITED DATA COLLECTION

The data for the limited data scenario con-sisted entirely of phrases and sentences that were translated and aligned by Hindi speakers using our elicitation tool Two very di_erent corpora were used for elicitation our typologi-cal elicitation corpus and a set of phrases from the Brown Corpus ([Fra ]) that we ex-tracted from the Penn Treebank ([Tre ])The typological elicitation corpus covers basic sentence and noun phrase types moving from simpler to more complex sentences as a lin-guistic _eld worker would do We use it to in-sure that at least one example of each basic phenomenon (tense agreement case mark-ing pronouns possessive noun phrases with various types of possessors etc) is encoun-tered However the elicitation corpus has the shortcomings that we would encounter with any arti_cial corpus The vocabulary is limited the distribution and frequency of phrases does not reflect what would occur in naturally oc-curring text and it does not cover everything that occurs in a natural corpusWe would like to have the advantages of a natural corpus but natural corpora also have shortcomings In order to contain enough ex-amples to _ll paradigms of basic phenomena the corpus must be large and in order to con-tain su_cient examples of sparse phenomena it must be very large Furthermore we would like to maintain the convenience of a composi-

tionally ordered corpus with smaller phrases building up into larger onesAs a compromise we used the Penn TreeBank to extract phrases from the Brown Corpus The phrases were extracted from the parsed trees so that they could be sorted according to their daughter nodes (noun phrases con-taining only nouns noun phrases containing determiners and nouns etc) In this way we obtained a naturally occurring corpus that was also ordered compositionallyThe 864 phrases and sentences from the ty-pological elicitation corpus were translated into Hindi by three Hindi speakers working to-gether After the _rst 150 sentences we checked the vocabulary spelling and word alignments The Hindi speakers were then in-structed on how to align case markers and auxiliary verbs in the way that is best for our rule learning system and completed the translation and alignment of the corpus in less than 20 hours (resulting in a total of about 60 hours of human labor)The extraction of phrases from the Brown Cor-pus resulted in tens of thousands of noun phrases and prepositional phrases some con-taining embedded sentences The phrases were _rst sorted by their complexity deter-mined by the depth of the parse-tree of the extracted phrase They were then divided into _les of about 200 phrases per _le The _les were distributed to _fteen Hindi speakers for translation and alignment After each Hindi speaker had translated about two hundred phrases (one _le) the spelling and grammar were checked Some non-native speakers were then eliminated from the pool of transla-tors Because the Brown Corpus contains some obscure vocabulary (eg names of chemical compounds) and because some noun phrases and prepositional phrases were not understandable out of context the Hindi speakers were instructed to skip any phrases that they couldnt translate instantly Only a portion of the _les of extracted data were translated by reliable informants The _nal re-sulting collection consisted of 85 _les adding up to a total of 17589 translated and word-aligned phrasesWe estimated the total amount of human e_ort required in collecting translating and aligning the elicited phrases based on a sam-ple The estimated time spent on translating and aligning a _le (of 200 phrases) was about 8 hours Translation took about 75 of the time and alignment about 25 We estimate the total time spent on all 85 _les to be about 700 hours of human laborOur approach requires elicited data to be translated from English into the minor lan-guage (Hindi in this case) even though our

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 25: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

trained Xfer system performs translation in the opposite direction This has both advan-tages and disadvantages The main advan-tage was our ability to rely on a extensive re-sources available for English such as tree-banks The main disadvantage was that typing in Hindi was not very natural even for the na-tive speakers of the language resulting in some level of typing errors This however did not pose a major problem because the ex-tracted rules are mostly generalized to the part-of-speech level Furthermore since the runtime translation direction is from Hindi to English rules that include incorrect Hindi spelling will not match during translation but will not cause incorrect translation

4 HINDI-TO-ENGLISH TRANSFER MT SYSTEMDescription Morpher Xferpast participle (tam = yA) (aspect = perf) (form = part)present participle (tam = wA) (aspect = imperf) (form = part)in_nitive (tam = nA) (form = inf)future (tam = future) (tense = fut)subjunctive (tam = subj) (tense = subj)root (tam = 0) (form = root)Table I Tense Aspect and Mood Features for Morpher and Xfer41 Morphology and Grammars

411 Morphology

The morphology module used by the runtime Xfer system was the IIIT Morpher [Mor ] Given a fully inflected word in Hindi Morpher out-puts the root and other features such as gen-der number and tense To integrate the IIIT Morpher with our system we installed it as a server The IIIT Morpher uses a romanized character encoding for Hindi known as Ro-man-WX Since the rest of our system was de-signed to process Hindi in UTF-8 Unicode en-coding we implemented an input-output wrapper interface with the IIIT Morpher that converted the input and output Hindi encod-ing as neededFigure 4 shows a sample of the morphology output for the word raha (continue stay keep) The left hand side of the _gure shows the raw output of the morphology system The right hand side shows our transformation of the Morpher output into our grammar formal-ism and feature system Table I shows the set of features that Morpher uses for tense as-pect and mood and the corresponding fea-tures that we mapped them into

412 Manually written grammar

A complete transfer grammar cannot be writ-ten in one month (as in the Surprise Language Exercise) but partial manually developed

grammars can be developed and then used in conjunction with automatically learned rules lexicon entries and even other MT engines in a multi-engine system During the SLE we ex-perimented with both hand written grammars and automatically learned grammars While the main focus of our research is on develop-ing automatic learning of transfer rules the manually developed transfer rule grammar can serve as an excellent point of comparison in translation evaluations Furthermore as pointed out above the manual and automati-cally learned grammars can in fact be compli-mentary and combined togetherOur grammar of manually written rules has 70 transfer rules The grammar includes a rather large verb paradigm with 58 verb sequence rules ten recursive noun phrase rules and two prepositional phrase rulesThe verb sequences that are covered by the rules cover the following tense aspect and mood categories simple present simple past subjunctive future present perfect past per-fect future perfect progressive past progres-sive and future progressive Each tenseas-pectmood can be combined with one of the light verbs jAnA (go to indicate completion) lenA (take to indicate an action done for ones own bene_t) or denA (give to indicate an action done for anothers bene_t) Active and passive voice are also covered for all tenseaspectmoodsThe noun phrase rules include pronouns nouns compound nouns adjectives and de-terminers For the prepositions the PP rules invert the order of the postposition and the governed NP and move it after the next NP The rules shown in Figure 5 can flip arbitrarily long left branching Hindi NPs into right branching English NPs as shown in Figure 6

NP12NPNP [PP NP1] -gt [NP1 PP]((X1Y2)(X2Y1))NP13NPNP [NP1] -gt [NP1]((X1Y1))PP12PPPP [NP Postp] -gt [Prep NP]((X1Y2)(X2Y1))Fig 5 Recursive NP RulesHindi NP with left recursion(jIvana (life) ke (of) eka (one) aXyAya (chapter))[np [pp [np [nbar [n jIvana]]] [p ke]] [nbar [adj eka] [n aXyAya]]]English NP with right recursion(one chapter of life)[np [nbar [adj one] [n chapter]] [pp [p of] [np [nbar [n life]]]]]Fig 6 Transfer of a Hindi recursive NP into EnglishNPNP [ADJ N] -gt [ADJ N]((X1Y1) (X2Y2)((X1 NUM) = (Y2 NUM))((X2 CASE) = (X1 CASE))

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 26: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

((X2 GEN) = (X1 GEN))((X2 NUM) = (X1 NUM)))PPPP [NP POSTP] -gt [PREP NP]((X2Y1)(X1Y2))PPPP [N CONJ NUM N N N POSTP] -gt [PREP N CONJ NUM N N N]((X7Y1) (X1Y2) (X2Y3) (X3Y4) (X4Y5) (X5Y6) (X6Y7))Fig 7 Some Automatically Learned Rules

413 Automatically learned grammar

In addition to the manually written grammar we applied our rule-learning module to the corpus of collected NP and PP phrases and acquired a grammar of automatically inferred transfer rules The rules were learned as de-scribed briefly in Section 2 and in greater de-tail in [Probst et al 2003]The learned grammar consists of a total of 327 rules which are exclusively NP and PP rules as inferred from the Penn Treebank elicited data In a second round of experi-ments we assigned probabilities to the rules based on the frequency of the rule (ie how many training examples produce a certain rule) We then pruned rules with low probabil-ity resulting in a grammar of a mere 16 rules The rationale behind pruning rules is that low-probability rules will produce spurious transla-tions most of the time as they _re in contexts where they should not actually apply In our experience rule pruning has very little effect on the translation performance but great im-pact on the e_ciency of the systemFigure 7 shows some rules that were automat-ically inferred from the training data Note that the second rule contains a non-terminal symbol (NP) that waslearned by the compositionality module

42 Lexical Resources

The transfer engine uses a grammar and a lexicon for translation The lexicon contains entries from a variety of sources The most obvious source for lexical translation pairs is the elicited corpus itself The translations pairs can simply be read o_ from the align-ments that were manually provided by Hindi speakers Because the alignments did not need to be 1-to-1 the resulting lexical transla-tion pairs can have strings of more than one word on either the Hindi or English side or bothAnother source for lexical entries is the is Eng-lish-Hindi dictionary provided by the Linguistic Data Consortium (LDC) The LDC dictionary contains many (as many as 25) English trans-lations for each Hindi word Since some of the English translations are not frequent or are not frequent translations of the Hindi word

two local Hindi experts cleaned up a portion of this lexicon by editing the list of English translations provided for the Hindi words and leaving only those that were best bets for being reliable all-purpose translations of the Hindi word The full LDC lexicon was _rst sorted by Hindi word frequency (estimated from Hindi monolingual text) and the cleanup was performed on the most frequent 12 of the Hindi words in the lexicon The clean portion of the LDC lexicon was then used for the limited-data experiment This consisted of 2725 Hindi words which corresponded to about 10000 translation pairs This effort took about 3 days of manual laborThe entries in the LDC dictionary are in root forms (both in English and Hindi) In order to be able to produce inflected English forms such as plural nouns we `enhanced the dic-tionary with such forms The enhancement works as follows for each noun in the dictio-nary (where the part-of-speech tags are indi-cated in the original dictionary and cross-checked in the British National Corpus [Leech 1992]) we create an additional entry with the Hindi word unchanged and the English word in the plural form In addition to the changed English word we also add a constraint to the lexical rule indicating that the number is plu-ral A similar strategy was applied to verbs we added entries (with constraints) for past tense past participle gerund (and continu-ous) as well as future tense entry The result is a set of lexical entries associating Hindi root forms to English inflected formsHow and why do these additional entries work The transfer engine _rst runs each Hindi input word through the morphological ana-lyzer The Morpher returns the root form of the word along with a set of morphological features The root form is then matched against the lexicon and the set of lexical transfer rules that match the Hindi root are extracted The morphological features of the input word are then uni_ed with the feature constraints that appear in each of the candi-date lexical transfer rules pruning out the rules that have inconsistent features For ex-ample assume we are processing a plural noun in Hindi The root form of the noun will be used to extract candidate lexical transfer rules from the lexicon but only the entry that contains a plural form in English will pass uni_cation and succeedSince we did not have immediate access to an English morphology module the word inflec-tions were primarily based on spelling rules Those rules indicate for instance when to reduplicate a consonant when to add `es in-stead of a simple `s for plural etc For irregu-lar verb inflections we consulted a list of Eng-

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 27: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

lish irregular verbs With these enhance-ments the part of our dictionary derived from the LDCHindi-English dictionary contains a total of 23612 translation pairsTo create an additional resource for high-qual-ity translation pairs we used monolingual Hindi text to extract the 500 most frequent bi-grams These bigrams were then translated into English by an expert in about 2 days Some judement was applied in selecting bi-grams that could be translated reliably out of contextFinally our lexicon contains a number of man-ually written rules

-72 manually written phrase transfer rules a bilingual speaker manually entered English translation equivalents for 72 common Hindi phrases that were observed as frequently oc-curring in development-test data early on dur-ing the SLE

-105 manually written postposition rules Hindi uses a collection of post-position words that generally correspond with English prepo-sitions A bilingual speaker manually entered the English translation equivalents of 105 common postposition phrases (combinations of nouns and the following postpositions) as observed in development-test data

-48 manually written time expression rules a bilingual speaker manually entered the Eng-lish translation equivalents of 48 common time expressions as observed in develop-ment-test data

43 Runtime Configuration

The transfer engine is run in a mode where it outputs all possible (partial) translations The result of a transfer instance on a given input sentence is a lattice where each entry indi-cates which Hindi words the partial translation spans and what the potential English transla-tion is The decoder then rescores these en-tries and _nds the best path through the lat-tice (for more details see Section 24)Production of the lattice takes place in three passes Although all possible (partial) transla-tions are output we have to treat with care the interaction between the morphology the grammar rules and the lexicon The _rst pass matches the Hindi sentence against the lexi-con in their full form before applying mor-phology This is designed especially to allow phrasal lexical entries to _re where any lexi-cal entry with more than one Hindi word is considered a phrasal entry In this phase no grammar rules are applied

Another pass runs each Hindi input word through the morphology module to obtain the root forms along with any inflectional features of the word These rootforms are then fed into the grammar rules Note that this phase takes advantage of the enhancements in the lexicon as described in Section 42Finally the original Hindi words (in their in-flected forms) are matched against the dictio-nary producing word-to-word translations without applying grammarrulesThese three passes exhaust the possibilities of matching lexical entries and applying gram-mar rules in order to maximize the size of the resulting lattice A bigger lattice is generally preferable as its entries are rescored and low-probability entries are not chosen by the lan-guage model for the _nal translation

5 THE LIMITED DATA SCENARIO EXPERIMENT

51 Training and Testing Data

In order to compare the e_ectiveness of the Xfer system with data-driven MT approaches (SMT and EBMT) under a scenario of truly lim-ited data resources we arti_cially crafted a collection of training data for Hindi-to-English translation that intentionally contained ex-tremely limited amounts of parallel text The training data was extracted from the larger pool of resources that had been collected throughout the SLE by our own group as well as the other participating groups in the SLE The limited data consisted of the following re-sources

-Elicited Data Corpus 17589 word-aligned phrases and sentences from the elicited data collection described in section 3 This includes both our translated and aligned controlled elicitation corpus and also the translated and aligned uncontrolled corpus of noun phrases and prepositional phrases extracted from the Penn Treebank

-Small Hindi-to-English Lexicon 23612 clean translation pairs from the LDC dictionary (as described in Section 42)

-Small Amount of Manually Acquired Re-sources (As described in Section 42) | 500 most common Hindi bigrams 72 manually written phrase transfer rules 105 manually written postposition rules and 48 manually written time expression rules

The limited data setup includes no additional parallel Hindi-English text The total amount of

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 28: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

bilingual training data was estimated to amount to about 50000 wordsA small previously unseen Hindi text was se-lected as a test-set for this experiment The test-set chosen was a section of the data col-lected at Johns Hopkins University during the later stages of the SLE using a web-based in-terface [JHU] The section chosen consists of 258 sentences for which four English refer-ence translations are available An example test sentence can be seen in Figure 8Fig 8 Example of a Test Sentence and Reference Trans-lations

System BLEU NISTEBMT 0058 422SMT 0102 (+- 0016) 470 (+- 020)Xfer no grammar 0109 (+- 0015) 529 (+- 019)Xfer learned grammar 0112 (+- 0016) 532 (+- 019)Xfer manual grammar 0135 (+- 0018) 559 (+- 020)MEMT (Xfer +SMT) 0136 (+- 0018) 565 (+- 021)Table II System Performance Results for the Various Translation Approaches

NIST scores44244464855254565860 1 2 3 4Reordering WindowNIST ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 9 Results by NIST score

BLEU scores007008009010110120130140150 1 2 3 4Reordering WindowBLEU ScoreMEMT(SMT+XFER)XFER-ManGramXFER-LearnGramXFER-NoGramSMT

Fig 10 Results by BLEU score

52 Experimental Testing Con_guration

The experiment was designed to evaluate and compare several di_erent con_gurations of the Xfer system and the two corpus-based sys-tems (SMT and EBMT) all trained on the same limited-data resources and tested on the same test setThe transfer engine was run in a setting that _nds all possible translations that are consis-tent with the transfer rules The transfer en-gine produces a complete lattice of all possi-ble partial translations The lattice is then in-put into a statistical decoder Decoding is per-

formed as described in Section 24 We used an English language model that was trained on 70 million English words As an additional functionality the decoder can perform a lim-ited reordering of arcs during decoding At any point in the decoding arcs that have a start position index of up to four words ahead are considered and allowed to be moved to the current position in the output if this im-proves the overall score of the resulting out-put string according to the English language modelThe following systems were evaluated in the experiment

(1) The following versions of the Xfer system

(a) Xfer with No Grammar Xfer with no syn-tactic transfer rules (ie only phrase-to-phrase matches and word-to-word lexical transfer rules with and without morphology)(b) Xfer with Learned Grammar Xfer with au-tomatically learned syntactic transfer rules as described in section 413(c) Xfer with Manual Grammar Xfer with the manually developed syntactic transfer rules as described in section 412

(2) SMT The CMU Statistical MT (SMT) system [Vogel et al 2003] trained on the limited-data parallel text resources

(3) EBMT The CMU Example-based MT (EBMT) system [Brown 1997] trained on the limited-data parallel text resources

(4) MEMT A ldquomulti-engine version that com-bines the lattices produced by the SMT sys-tem and the Xfer system with manual gram-mar The decoder then selects an output from the joint lattice

53 Experimental Results

Performance of the systems was measured using the NIST scoring metric [Doddington 2003] as well as the Bleu score [Papineni et al 2002] In order to validate the statistical signi_cance of the di_erences in NIST and Bleu scores we applied a commonly used sampling technique over the test set we randomly draw 258 sentences independently from the set of 258 test sentences (thus sentences can appear zero once or more in the newly drawn set) We then calculate scores for all systems on the randomly drawn set (rather than the original set) This process was re-peated 10000 times Median scores and 95 con_dence intervals were calculated based on the set of scores

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 29: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Table II compares the results of the di_erent systems (with the decoders best reordering window) along with the 95 con_dence inter-vals Figures 9 and 10 show the e_ect of di_er-ent systems with di_erent reordering windows in the decoder For clarity con_dence inter-vals are graphically shown only for the NIST scores (not for Bleu) and only for SMT Xfer with no grammar and MEMT

54 Discussion of Results

The results of the experiment clearly show that under the speci_c miserly data training scenario that we constructed the Xfer sys-tem with all its variants signi_cantly outper-formed the SMT system While the scenario of this experiment was clearly and intentionally more favorable towards our Xfer approach we see these results as a clear validation of the utility and e_ectiveness of our transfer ap-proach in other scenarios where only very lim-ited amounts of parallel text and other online resources are available In earlier experiments during the SLE we observed that SMT outper-formed the Xfer system when much larger amounts of parallel text data were used for system training This indicates that there ex-ists a data ldquocross-over point between the performance of the two systems given more and more data SMT will outperform the cur-rent Xfer system Part of future work will be to _rst determine this cross-over point and then to attempt to push this point further toward scenarios where more data is given thus making the Xfer system applicable to a wider variety of conditionsThe use of morphology within the Xfer system was also a signi_cant factor in the gap in per-formance between the Xfer system and the SMT system in theexperiment Token word coverage of the test-set without morphology is about 70 whereas with morphology token coverage in-creases to around 79 We acknowledge that the availability of a high-coverage morphologi-cal analyzer for Hindi worked to our favor and a morphological analyzer of such quality may not be available for many other languages Our Xfer approach however can function even with partial morphological information with some consequences on the e_ectiveness and generality of the rules learnedThe results of the comparison between the various versions of the Xfer system also show interesting trends although the statistical signi_cance of some of the di_erences is not very high Xfer with the manually developed transfer rule grammar clearly outperformed (with high statistical signi_cance) Xfer with no grammar and Xfer with automatically learned

grammar Xfer with automatically learned grammar is slightly better than Xfer with no grammar but the di_erence is statistically not very signi_cant We take these results to be highly encouraging since both the manually written and automatically learned grammars were very limited in this experiment The au-tomatically learned rules only covered NPs and PPs whereas the manually developed grammar mostly covers verb constructions While our main objective is to infer rules that perform comparably tohand-written rules it is encouraging that the hand-written grammar rules result in a big performance boost over the no-grammar sys-tem indicating that there is much room for improvement If the learning algorithms are improved the performance of the overall sys-tem can also be improved signi_cantlyThe signi_cant e_ects of decoder reordering are also quite interesting On one hand we believe this indicates that various more so-phisticated rules could be learned and that such rules could better order the English out-put thus reducing the need for re-ordering by the decoder On the other hand the results in-dicate that some of the ldquoburden of reordering can remain within the decoder thus possibly compensating for weaknesses in rule learningFinally we were pleased to see that the con-sistently best performing system was our multi-engine con_guration where we com-bined the translation hypotheses of the SMT and Xfer systems together into a common lat-tice and applied the decoder to select a _nal translation The MEMT con_guration outper-formed the best purely Xfer system with rea-sonable statistical con_dence Obtaining a multiengine combination scheme that consis-tently outperforms all the individual MT en-gines has been notoriously di_cult in past re-search While the results we obtained here are for a unique data scenario we hope that the framework applied here for multi-engine inte-gration will prove to be e_ective for a variety of other scenarios as well The inherent di_er-ences between the Xfer and SMT approaches should hopefully make them complementary in a broad range of data scenarios

6 CONCLUSIONS AND FUTURE WORK

The DARPA-sponsored SLE allowed us to de-velop and test an open-domain largescale Hindi-to-English version of our Xfer system This experience was extremely helpful for en-hancing the basic capabilities of our system The lattice-based decoding was added to our system at the very end of the month-long SLE and proved to be very e_ective in boosting the overall performance of our Xfer system

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 30: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

The experiments we conducted under the ex-tremely limited Hindi data resources scenario were very insightful The results of our experi-ments indicate that our Xfer system in its cur-rent state outperforms SMT and EBMT when the amount of available parallel training text is extremely small The Xfer system with man-ually developed transfer-rules outperformed the version of the system with automatically learned rules This is partly due to the fact that we only attempted to learn rules for NPs and VPs in this experiment We see the cur-rent results as an indication that there is signi_cant room for improving automatic rule learning In particular the learning of uni_cation constraints in our current system requires signi_cant further researchIn summary we feel that we have made signi_cant steps towards the development of a statistically grounded transfer-based MT sys-tem with (1) rules that are scored based on a well-founded probability model and (2) strong and e_ective decoding that incorporates the most advanced techniques used in SMT de-coding Our work complements recent work by other groups on improving translation perfor-mance by incorporating models of syntax into traditional corpus-driven MT methods The fo-cus of our approach however is from the op-posite end of the spectrum we enhance the performance of a syntactically motivated rule-based approach to MT using strong statistical methods We _nd our approach particularly suitable for anguages with very limited data resources

Acknowledgements

This research was funded in part by the DARPA TIDES program and by NSFgrant number IIS-0121-631 We would like to thank our team of Hindi-Englishbilingual speakers in Pittsburgh and in India that conducted the data collection forthe research work reported in this paper

REFERENCES

The Brown Corpus httpwwwhituibnoicamebrownbcmhtmlThe Johns Hopkins University Hindi translation webpage httpnlpcsjhuedu hindiMorphology module from IIIT httpwwwiiitnetltrcmorphindexhtmThe Penn Treebank httpwwwcisupennedu~treebankhomehtmlAllen J 1995 Natural Language Understanding Second Edition ed Benjamin CummingsBrown P Cocke J Della Pietra V Della Pietra S Je-linek F Lafferty J MercerR and Roossin P 1990 A statistical approach to Ma-chine Translation ComputationalLinguistics 16 2 7985Brown P Della Pietra V Della Pietra S and Mercer R 1993 The mathematics

of statistical Machine Translation Parameter estimation Computational Linguistics 19 2263311Brown R 1997 Automated dictionary extraction for knowledge-free example-based translationIn International Conference on Theoretical and Method-ological Issues in Machine Translation111118Doddington G 2003 Automatic evaluation of machine translation quality using n-gram cooccurrencestatistics In Proceedings of Human Language TechnologyJones D and Havrilla R 1998 Twisted pair grammar Support for rapid development ofmachine translation for low density languages In Pro-ceedings of the Third Conference of theAssociation for Machine Translation in the Americas (AMTA-98)Leech G 1992 100 million words of english The British National Corpus Language Research28 1 113Nirenburg S 1998 Project Boas A linguist in the box as a multi-purpose language In Proceedingsof the First International Conference on Language Re-sources and Evaluation (LREC-98)Och F J and Ney H Discriminative training and maxi-mum entropy models for statisticalmachine translationPapineni K Roukos S and Ward T 1998 Maximum likelihood and discriminative trainingof direct translation models In Proceedings of the Interna-tional Conference on AcousticsSpeech and Signal Processing (ICASSP-98) 189192Papineni K Roukos S Ward T and Zhu W-J 2002 Bleu a method for automaticevaluation of machine translation In Proceedings of 40th Annual Meeting of the Associationfor Computational Linguistics (ACL) Philadelphia PA 311318Peterson E 2002 Adapting a transfer engine for rapid machine translation development MSthesis Georgetown UniversityProbst K Brown R Carbonell J Lavie A Levin L and Peterson E 2001 Designand implementation of controlled elicitation for machine translation of low-density languagesIn Workshop MT2010 at Machine Translation Summit VIIIProbst K and Levin L 2002 Challenges in automated elicitation of a controlled bilingualcorpus In Theoretical and Methodological Issues in Ma-chine Translation 2002 (TMI-02)Probst K Levin L Peterson E Lavie A and Carbonell J 2003 Mt for resource-poorlanguages using elicitation-based learning of syntactic transfer rules Machine Translation toappearSato S and Nagao M 1990 Towards memory-based translation In COLING-90 247252ACM Transactions on Computational Logic Vol V No N January 2004Experiments with a Hindi-to-English Transfer-based MT System _ 21Sherematyeva S and Nirenburg S 2000 Towards a un-versal tool for NLP resource acquisitionIn Proceedings of the 2nd International Conference on Language Resources andEvaluation (LREC-00)Vogel S and Tribble A 2002 Improving statistical ma-chine translation for a speech-tospeechtranslation task In Proceedings of the 7th International Conferenc on Spoken LanguageProcessing (ICSLP-02)Vogel S Zhang Y Tribble A Huang F Venugopal A Zhao B and Waibel A2003 The cmu statistical translation system In MT Sum-mit IX

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 31: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Yamada K and Knight K 2001 A syntax-based statisti-cal translation model In Proceedingsof the 39th Anniversary Meeting of the Association for Computational Linguistics (ACL-01)ACM Transactions on Computational Logic Vol V No N January 2004

3 Conclusions and Future WorkThe ldquofirst-things-firstrdquo approach the AVENUE project

has taken to building NLP systems for scarce-resource languages has proven effective By first focusing effort on producing basic NLP resources the resultant resources are of sufficient quality to be put to any number of uses from building all manner of NLP tools to potentially aiding lin-guists in better understanding an indigenous culture and language For both Mapudungun and Quechua separate work on morphology analysis and on a transfer grammar modularized the problem in a way that allowed rapid de-velopment Besides a spelling-checker for Mapudungun the AVENUE team has developed computational lexicons morphology analyzers and one or more Machine Transla-tion systems for Mapudungun and Quechua into Spanish

The AVENUE team has recently put many of the re-sources we have developed for Mapudungun online at httpwwwlenguasamerindiasorgmapudungun The AVENUE interactive website which is still in an experi-mental phase contains a basic Mapudungun-Spanish lexi-con the Mapudungun morphological analyzer and the ex-ample-based MT system from Mapudungun to Spanish

The AVENUE team continues to develop the resources for both Mapudungun and Quechua We are actively working on improving the Mapudungun-Spanish rule-based MT system by both increasing the size of the lexi-con as well as improving the rules themselves The Ma-pudungun-Spanish example-based MT system can be im-proved by cleaning and increasing the size of the training text For the next version of the MT website we plan to plug in the Translation Correction Tool to allow bilingual users interested in translating sentences to give us feed-back about the correctness of the automatic translation produced by our systems in a simple and user-friendly way

31 Conclusions and Future WorkOur current Hebrew-to-English system was developed

over the course of a two-month period with a total labor-effort equivalent to about four person-months of develop-ment Unique problems of the Hebrew language particu-larly the inherent high-levels of ambiguity in morphology and orthography were addressed The bilingual dictionary and the morphological analyzer that were available to us had serious limitations but were adapted in novel ways to support the MT task The underlying transfer-based framework which we applied proved to be sufficient and appropriate for fast adaptation to Hebrew as a new source language This provides some encouraging validation to the general suitability of this framework to rapid MT pro-totyping for languages with limited linguistic resources

The results of our rapid-development effort exceeded our expectations Evaluation results on an unseen test-set and an examination of actual translations of development data indicate that the current system is already effective

enough to produce comprehensible translations in many cases This was accomplished with only a small manually-developed transfer grammar that covers the structural transfer of common noun-phrases and a small number of other common structures We believe that it is the combi-nation of this very simple transfer grammar with the selec-tional disambiguation power provided by the English tar-get language model that together result in surprisingly ef-fective translation capabilities

We plan to continue the development of the Hebrew-to-English system over the next year Significant further work will be required to improve the coverage and quality of the word translation lexicon and the morphological an-alyzer Several advanced issues that were not addressed in the current system will be investigated We plan to signifi-cantly enhance the sources of information used by the de-coder in disambiguating and selecting among translation segments These include (1) a language model for the source language trained from a monolingual Hebrew cor-pus which can help disambiguate the morphological read-ings of input words (2) scores for the transfer rules based on a scoring model currently under development and (3) a probabilistic version of the translation lexicon which we hope to train once we collect some amount of parallel He-brew-English data We expect dramatic further improve-ments in translation quality once these issues are properly addressed

In the conclusion for the paper contrasting Hindi He-brew Mapu and Quechua we should list the other lan-guages we have worked on and will be working on

Worked on DutchBeginning work on Inupiaq Brazillian Portugese lan-

guages native to Brail Aymara

4 AcknowledgementsMany people beyond the authors of this paper contrib-

uted to the work described The Chilean AVENUE team in-cludes Carolina Huenchullan Arruacutee Claudio Millacura Salas Eliseo Cantildeulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiqui-ntildeir Marcela Collio Calfunao Cristian Carrillan Anton and Salvador Cantildeulef Rodolfo Vega was the liaison be-tween the CMU team and the government agency and in-digenous community in Chile People that participated in the development of Quechua data are Marilyn Feke Sa-lomeacute Gutierrez Irene Goacutemez and Yenny Ccolque

This research was funded in part by the DARPA TIDES program and by NSF grant number IIS-0121-631

Acknowledgments

The research reported in this paper was made possible by support from the Caesarea Rothschild Institute at Haifa University and was funded in part by NSF grant number IIS-0121631 We wish to thank the Hebrew Knowledge Center at the Technion for providing the morphological analyzer

5 ReferencesAllen James (1995) Natural Language Understanding

Second Edition ed Benjamin Cummings

Carnegie Mellon University 271206
Acknowledgements from the Hebrew paper
Carnegie Mellon University 271206
Conclusions from the Hebrew paper that can either be deleted or incorporated into the overall conclusions for this paper

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 32: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Brown Ralf D Rebecca Hutchinson Paul NBennett Jaime G Carbonell and Peter Jansen (2003) Reduc-ing Boundary Friction Using Translation-Fragment Overlap in Proceedings of the Ninth Machine Transla-tion Summit New Orleans USA pp 24-31

Brown Ralf D (2000) Example-Based Machine Trans-lation at Carnegie Mellon University In The ELRA Newsletter European Language Resources Association vol 51 January-March 2000

Cusihuaman Antonio (2001) Gramatica Quechua Cuzco Callao 2a edicioacuten Centro Bartolomeacute de las Casas

Font Llitjoacutes Ariadna Jaime Carbonell Alon Lavie (2005a) A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation Eu-ropean Association of Machine Translation (EAMT) 10th Annual Conference Budapest Hungary

Font Llitjoacutes Ariadna Roberto Aranovich and Lori Levin (2005b) Building Machine translation systems for in-digenous languages Second Conference on the Indige-nous Languages of Latin America (CILLA II) Texas USA

Font Llitjoacutes Ariadna and Jaime Carbonell (2004) The Translation Correction Tool English-Spanish user stud-ies International Conference on Language Resources and Evaluation (LREC) Lisbon Portugal

Frederking Robert and Sergei Nirenburg (1994) Three Heads are Better than One Proceedings of the fourth Conference on Applied Natural Language Processing (ANLP-94) pp 95-100 Stuttgart Germany

Lavie A S Wintner Y Eytani E Peterson and K Probst (2004) Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004) Baltimore MD October 2004 Pages 1-10

Lavie Alon Stephan Vogel Lori Levin Erik Peterson Katharina Probst Ariadna Font Llitjoacutes Rachel Reynolds Jaime Carbonell and Richard Cohen (2003) Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario ACM Transactions on Asian Language Information Process-ing (TALIP) 2(2)

Levin Lori Alison Alvarez Jeff Good and Robert Fred-erking (In Press) Automatic Learning of Grammatical Encoding To appear in Jane Grimshaw Joan Maling Chris Manning Joan Simpson and Annie Zaenen (eds) Architectures Rules and Preferences A Festschrift for Joan Bresnan CSLI Publications

Levin Lori Rodolfo Vega Jaime Carbonell Ralf Brown Alon Lavie Eliseo Cantildeulef and Carolina Huenchullan (2000) Data Collection and Language Technologies for Mapudungun International Conference on Language Resources and Evaluation (LREC)

Mitchell Marcus A Taylor R MacIntyre A Bies C Cooper M Ferguson and A Littmann (1992) The Penn Treebank Project httpwwwcisupennedu tree-bankhomehtml

Monson Christian Lori Levin Rodolfo Vega Ralf Brown Ariadna Font Llitjoacutes Alon Lavie Jaime Car-bonell Eliseo Cantildeulef and Rosendo Huesca (2004) Data Collection and Analysis of Mapudungun Morphol-ogy for Spelling Correction International Conference on Language Resources and Evaluation (LREC)

Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu (2001) BLEU A Method for Automatic Evaluation of Machine Translation IBM Research Re-port RC22176 (W0109-022)

Peterson Erik (2002) Adapting a transfer engine for rapid machine translation development MS thesis Georgetown University

Probst Katharina (2005) Automatically Induced Syntac-tic Transfer Rules for Machine Translation under a Very Limited Data Scenario PhD Thesis Carnegie Mellon

Probst Katharina and Alon Lavie (2004) A structurally diverse minimal corpus for eliciting structural mappings between languages Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04)

Probst Katharina Ralf Brown Jaime Carbonell Alon Lavie Lori Levin and Erik Peterson (2001) Design and Implementation of Controlled Elicitation for Ma-chine Translation of Low-density Languages Proceed-ings of the MT2010 workshop at MT Summit

References

Carbonell Jaime Katharina Probst Erik Peterson Christian Monson Alon Lavie Ralf Brown and

Lori Levin 2002 Automatic Rule Learning for Re-source-Limited MT In Proceedings of the

5th Biennial Conference of the Association for Ma-chine Translation in the Americas (AMTA-02)

OctoberDahan Hiya 1997 HebrewndashEnglish EnglishndashHebrew

Dictionary Academon JerusalemDoddington George 2002 Automatic Evaluation of

Machine Translation Quality Using N-gram Co-Occurrence Statistics In Proceedings of the Second

Conference on Human Language Technology(HLT-2002)Efron Bradley and Robert Tibshirani 1986 Bootstrap

Methods for Standard Errors Condence Intervalsand Other Measures of Statistical Accuracy Statistical

Science 154ndash77Lavie Alon Stephan Vogel Lori Levin Erik Peter-

son Katharina Probst Ariadna Font Llitjos RachelReynolds Jaime Carbonell and Richard Cohen 2003

Experiments with a Hindi-to-EnglishTransfer-based MT System under a Miserly Data Sce-

nario Transactions on Asian Language InformationProcessing (TALIP) 2(2) JunePapineni Kishore Salim Roukos ToddWard

andWei-Jing Zhu 2002 BLEU a Method for AutomaticEvaluation of Machine Translation In Proceedings of

40th Annual Meeting of the Association forComputational Linguistics (ACL) pages 311ndash318

Philadelphia PA JulyPeterson Erik 2002 Adapting a Transfer Engine for

Rapid Machine Translation Development Mastersthesis Georgetown UniversityProbst Katharina Lori Levin Erik Peterson Alon

Lavie and Jaime Carbonell 2002 MT for Resource-Poor Languages Using Elicitation-Based Learning of

Syntactic Transfer Rules Machine Translation17(4)

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is
Page 33: [Título del trabajo --14] - Carnegie Mellon School of · Web viewAt the word level, lexical transfer rules are accessed in order to seed the individual lexical choices for the TL

Segal Erel 1999 Hebrew Morphological Analyzer for Hebrew Undotted Texts Masters thesis

Technion Israel Institute of Technology Haifa Octo-ber In Hebrew

Wintner Shuly 2004 Hebrew Computational Lin-guistics Past and Future Articial Intelligence

Review 21(2)113ndash138Wintner Shuly and Shlomo Yona 2003 Resources

for Processing Hebrew In Proceedings of theMT-Summit IX workshop on Machine Translation for

Semitic Languages New Orleans September

Carnegie Mellon University 271206
Refereces from the Hebrew paper
  • Omnivorous MT Building NLP Systems for Multiple Languages
  • Ariadna Font Llitjoacutes1 Christian Monson1 Roberto Aranovich2 Lori Levin1 Ralf Brown1 Katharina Probst3 Erik Peterson1 Jaime Carbonell1 Alon Lavie1
  • 1 Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • 2 Department of Linguistics
  • University of Pittsburgh
  • 3 Wherever Katharin is