international conference on universal knowledge and language (icukl2002), goa, 25-29 november 2002

29
A roadmap for MT : four « keys » to handle more languages, for all kinds of tasks, while making it possible to improve quality (on demand) International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002 Christian Boitet GETA, CLIPS, IMAG, 385 av. de la bibliothèque, BP 53 F-38041 Grenoble cedex 9, France [email protected], http://clips.imag.fr/geta

Upload: yon

Post on 16-Jan-2016

27 views

Category:

Documents


3 download

DESCRIPTION

A roadmap for MT : four « keys » to handle more languages, for all kinds of tasks, while making it possible to improve quality (on demand). International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

A roadmap for MT : four « keys »

to handle more languages, for all kinds of tasks,

while making it possible to improve quality (on demand)

International Conference on Universal Knowledge and Language

(ICUKL2002), Goa, 25-29 November 2002

Christian BoitetGETA, CLIPS, IMAG, 385 av. de la bibliothèque, BP 53

F-38041 Grenoble cedex 9, [email protected], http://clips.imag.fr/geta

Page 2: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 2/30

Outline

• Basic conceptsWhat is MT ?

Goals: Quality / User

Architectures: Vauquois' triangle

• State of the artMT of texts: examples, problems

MT of spoken dialogs

• The future of MTGoals

4 keys

Page 3: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 3/30

What is M(a)T ?

• At least 3 types of automationMT = Machine Translation

MAT = Machine Assisted Translation

MAHT = Machine Aided Human Translation

• A scientific technologyInformatics (computer science)

Linguistics

Mathematics

Page 4: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 4/30

Goals: Quality / User

User

Quality

linguisticallynaive

linguisticallyspecialized

rough, quick

MT for access

special fields :atom, chemistry…

general information

MT fortranslators

helps: lexicons,proposals from a

translation memory…

from raw tovery good

MT forindividual

authorswith interactivedisambiguation

MT for revisors(posteditors)

raw MT, polishable

Page 5: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 5/30

Architectures: Vauquois' triangle

Deep understanding levelInterlingual level

Ascending transferLogico-semantic level

Syntactico-functional level

Morpho-syntactic levelSyntagmatic level

Graphemic level Direct translation

Syntactic transfer (surface)Syntactic transfer (deep)

Conceptual transferSemantic transferMultilevel transfer

Ontological interlinguaSemantico-linguistic interlingua

SPA-structures (semantic& predicate-argument)

F-structures (functional)C-structures (constituent)

Tagged textText

Mixing levels Multilevel description

Semi-direct translationDescending transfers

Page 6: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 6/30

Architekturen: Vauquois Dreieck (größer)

Deep understanding levelInterlingual level

Ascending transferLogico-semantic level

Syntactico-functional level

Morpho-syntactic levelSyntagmatic level

Graphemic level Direct translation

Syntactic transfer (surface)Syntactic transfer (deep)

Conceptual transferSemantic transferMultilevel transfer

Ontological interlinguaSemantico-linguistic interlingua

SPA-structures (semantic& predicate-argument)

F-structures (functional)C-structures (constituent)

Tagged textText

Mixing levels Multilevel description

Semi-direct translationDescending transfers

Page 7: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 7/30

Formal intermediate structures

Linguisticlevel(s)

Linguisticmain

organization

Geometricalstructure

Algebraicstructure

CorrespondenceStructure—Text

Scope

Surface

Deep

1-level

n-level

Syntagms(constituents)

Dependencies

Logical andsemanticrelations

String

Chain graph(chart)

Tree structure

Graph /Network

Hypergraph

LabelsStruct. string

Booleanfeatures

Structuredattributes

Featurestructures

concrete(text ≈ readablefrom structure)

abstract(e.g. UNL)

Sentence(almost all)

Paragraph

Page(Ariane-G5,

Sygmart)

Document

Page 8: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 8/30

How to produce an MT system

• Choose an architecture

• Program the "tools"Spezialized languages for linguistic programming (SSLP)

Development environment (MT shell)

• Build the "lingware"Lexical data / rules / weights

Grammatical data / rules / weights

Possible specialization to a typology ("sublanguage")

• How?Human work ± computer help / support

Automatic learning (weights, likeliness…)

Page 9: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 9/30

State of affairs

• only a small number of language pairs is covered by MT systems designed for information accessSystran EC (2000): 19/110 language pairs, 8 OK for intended use

See also examples by Ronaldo Martins

• even fewer are capable of quality translation or speech translation

• Now a few examples…

Page 10: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 10/30

Examples: MT for access, Web (1)ENGLISH (human version) FRENCH (human version) ENGLISH (Systran FRE-ENG

version)The European-Heritage.netthesaurus covers the fields ofarchaeology and architecture asdefined in the Council of Europeconventions signed in Granada(1985) and Malta (1992).

Le thesaurus European-Heritage.netcouvre les champs de l'archéologie etde l'architecture au sens desconventions du Conseil de l'Europe deGrenade (1985) et de Malte (1992).

The European-Heritage.net thesauruscovers the fields of archaeology andarchitecture within the meaning ofconventions of the Council of Europeof Grenade (1985) and Malta (1992).

It encompasses informationranging from the partnersinvolved, categories of culturalassets and legislation, to activities,skills and funding. It issupplemented by a number ofspecific thesauruses compiled byeach member state on a particulartopic, such as the thesaurus onAndalusian heritage or thearchitectural thesaurus from theMérimée database in France.

Il prend en compte des aspects aussivariés que les acteurs, les catégoriesde biens culturels, la législation ouencore les interventions, les métiers etles financements. Il est complété etprolongé par des thesaurus spécifiquesdéveloppés par chaque Etat membresur tel ou tel sujet spécifique, commele thesaurus du patrimoine historiqueandalou ou le thesaurus d'architecturede la base de données documentaireMérimée en France.

It takes into account aspects as variedas the actors, the categories of culturalgoods, the legislation or theinterventions, the trades and thefinancings. It is supplemented andprolonged by thesaurus specificdeveloped by each Member State onsuch or such specific subject, like thethesaurus of the Andalusian historicalinheritance or the thesaurus ofarchitecture of the documentation database Mérimée in France.

This new, open-ended search toolwill come on line shortly, togetherwith a management andadministration system sharedamong the various contributors.

Cet instrument de recherche,forcément évolutif, sera misprochainement en ligne accompagnéd'un dispositif de gestion etd'administration réparti entre lesdifférents contributeurs.

This instrument of search, inevitablyevolutionary, will be put soon on lineaccompanied by a device ofmanagement and administrationdistributed between the variouscontributors.

Page 11: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 11/30

GERMAN (Systran ENG-GER version) GERMAN (Systran FRE-GER version)Der European-Heritage.netthesaurus umfaßt dieFelder von archaeology und von Architektur,wie in den Europaratvereinbarungen definiert,die in Granada (1985) unterzeichnet werden undin Malta (1992).

Der European-Heritage.net-Thesaurus bedecktdie Felder der Archäologie und der Architekturim Sinne der Übereinkommen des Europaratsvon Granada (1985) und von Malta (1992).

Er gibt die Informationen um, die von denbetroffenen Partnern, von den Kategorien derkulturellen Werte und der Gesetzgebung, bis zuAktivitäten, von den Fähigkeiten und von derFinanzierung reichen. Er wird durch eine Anzahlvon den spezifischen Thesauren ergänzt, diedurch jeden Mitgliedsstaat auf einembestimmten Thema, wie dem Thesaurus aufAndalusian Erbe oder dem architektonischenThesaurus von der Datenbank Mérimée inFrankreich kompiliert werden.

Er berücksichtigt Aspekte dermaßen variierte,daß die Beteiligten, die Kategorien kulturellerGüter, die Gesetzgebung oder noch dieInterventionen, die Berufe und dieFinanzierungen. Er wird vervollständigt undwird durch ein spezifische Thesaurus entwickeltdurch jeder Mitgliedstaat über das eines oderandere spezifische Thema verlängert, als derThesaurus des andalusischen historischenKulturgutes oder der Thesaurus der Architekturder urkundlichen Datenbank Mérimée inFrankreich.

Dieses neue, offene Suchhilfsmittel kommt aufZeile kurz, zusammen mit einem Management-und Leitungssystem, das unter denverschiedenen Mitwirkenden geteilt wird.

Dieses notgedrungen entwicklungsfähigeForschungsinstrument wird gestellt demnächstonline begleitet von einer Verwaltungs- undVerwaltungsvorrichtung, die aufgeteilt unter denverschiedenen Beitragenden.

Examples: MT for access, Web (2)• FE quite "easy", compared with EG and mainly FG

Page 12: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 12/30

Comparison: raw vs rough MT SpanAm raw Spanish-English output Reverso raw Spanish-English outputMessage of the Director-General of the WorldHealth Organization

Message of the Chief operating officer of the WorldOrganization of the Health

From its discovery, antibiotics have completelytransformed the perspective of humankind withrespect to infectious diseases. Today the use ofantibiotics, combined with improvements insanitation, housing, and nutrition, together with theadvent of the vaccination programs generalized,have caused a notable reduction of infectiousdiseases that previously were common andannihilated entire populations.

From his{*its*} discovery, the antibiotics have transformedcompletely the perspective of the humanity with regard tothe infectious diseases. Today the use of the antibiotics,cocktail with improvements in the reparation, the housingand the nutrition, together with the advent of the programsof widespread vaccination, they have given place to anotable decrease of infectious diseases that before werecommon and were annihilating entire populations.

Scourges that terrified millions of people, as plague,whooping cough, poliomyelitis, and the scarlatina,have been controlled or are on the verge of beingcontrolled. Now, in the dawn of a new millennium,humankind faces another crisis. Previously curablediseases as the gonorrhea and typhoid fever arebecoming rapidly difficult to treat, while oldassassins as tuberculosis and malaria now are armedof the increasingly impenetrable resistance to theantimicrobial drugs.

Scourges that terrified million persons, as the pest, thesavage cough, the poliomyelitis and the scarlatina, they havebeen controlled or are on the verge of be controlling. Now,in the dawn of a new millenium, the humanity faces withanother crisis. Diseases before curable as the gonorrhea andthe fever tifoidea they are becoming rapidly difficult totreat, whereas killer old men as the tuberculosis and themalaria are armed{*assembled*} now with the increasingimpenetrable resistance the antimicrobial ones.

This phenomenon is potentially contenible. Theproblem is increasingly profound and complex,accelerated by the abuse of antibiotics in thedeveloped countries and the paradoxicalunderutilization of the quality antimicrobial drugs inthe developing countries due to the poverty and tothe scarcity resulting from an effective health care.

This phenomenon is potentially contenible. The problem isincreasingly deep and complex, accelerated by the abuse ofthe antibiotics in the developed countries and theparadoxical subutilization of the antimicrobial ones ofquality in the countries in development due to the povertyand the resultant shortage of an attention of effective health.

Page 13: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 13/30

Examples: MT for revisors…

Page 14: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 14/30

…with BV-aero/FE (2)

Page 15: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 15/30

MT of spoken dialogs

• Specialized systems are already usable e.g. ATR/Matsushita, IBM, CSTAR/Nespole!…

Much "noise" and "ungrammaticalities"

But specializing is very helpful!

• General systems are also possible e.g. NEC/Xroad, Linguatec/Talk&Translate

Speech recognition is already good enough

Rough may be good enough (e.g. for chatting)

• Interpretation is different from translation……and participants are intelligent !

Similarity with access-oriented-MT

Page 16: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 16/30

French-Korean through IF (1)

Page 17: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 17/30

French-Korean through IF (2)

Page 18: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 18/30

French-Korean through IF (3)

Page 19: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 19/30

A road map… to which goals?

• MT of adequate quality

• Not only for access

• For all languages

Page 20: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 20/30

Four keys

• 2 on the technical side

• 2 on the organizational side

Compromize: a far wider coverage, a somewhat smaller asymptotic quality

• Automatic learning techniques

• Using non-textual pivots (intermediate formal descriptors)

Democratization, cooperation

• Cooperative development of open source linguistic resources on the Web

• Towards systems where quality can be improved "on demand" by users

Page 21: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 21/30

Learning techniques

• Extend the use of hybrid techniquessymbolic, numerical, or mixed

==> they have demonstrated their potential at the research level

• stochastic grammars

• weighted (or "neural") dictionaries

• or build new tools, intrinsically numericalinspiration from voice recognition

• 2 exampleslearning analyzers : text —> semantic tree (IBM)

learning implicit very detailed DG from tree bank (NAIST)

Page 22: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 22/30

Using non-textual pivots

• Semantico-pragmatic (ontological) pivotstask & domain oriented ==> limited applicability

• Abstract linguistic descriptorsthe most precise, but often too sophisticated

depend on each language

• Anglo-semantic pivot: UNL"the HTML of linguistic content"

• in UNL, a hypergraph represents the abstract structure of (supposedly) equivalent English utterance

less precise but "robust"

symbols constructed from English ==> usable by all developers

Page 23: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 23/30

score(icl>event,agt>human,fld>sport).@entry.@past.@complete

pos head(pof>body).@def

objagt

Ronaldo(icl>proper noun)

ins plt

goal(icl>abstract thing)

left(aoj<thing)

posmod

corner(icl>thing).@def

goal(icl>concrete thing)

A simple UNL graph

•Ronaldo has headed the ball into the left corner of the goal

Page 24: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 24/30

Cooperative development

• of open source linguistic resources

• on the WebMutualization is necessary at least for lexical knowledge

too costly even for the leaders

size (#entries) has to augment for each language (300K, 3M?)

#languages has to increase dramatically (11 —> 20 —> 180?)

Integration of human- and machine-oriented knowledge is useful

e.g. to produce mixed MT/MAHT systems

Page 25: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 25/30

A contribution: the Papillon project

• Goal: produce many open source dictionaries from a central lexical data base

• Means:build rich (DiCo) monolingual dictionaries of lexies (senses)interlink lexies by interlingual links (axies)use XML & associated tools as basis to generate many formats

• for humans and for machinesstart from (free) digital resourcesinduce "consumers" to become "producers" (contributors)

• Quality control:private accountscentral validating/integrating group

Page 26: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 26/30

Lexical

Database

Papillon database macrostructureUser User User

Dictionary Dictionary

Resource Resource Resource

Interaction withthe Dictionaries

Extraction ofDictionaries

Integration of existing resources

Human Contributors

Page 27: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

27/30 ICUKL2002, Goa, 25-29/11/2002 Ch. Boitet

Interlingual links based on translations = "AXIEs"

Possibility to link 1 lexie with >1 acceptions

References to other semantic systems: AXIE—1————n—>UW

PAPILLON diagramFrench. DiCo

Vocable carte n.f.

Lexie carte.1 carte à jouer

Lexie carte.2 carte géographique

Japan. DiCo

地図

カードAcception 343

UNL: card(icl>play),card(icl>thing)…

Acception 345

UNL: map(fld>geography)

Interlingual links

Acception 1002

UNL: card(fld>money)

a

Thai DiCo

Engl. DiCo

Vocable card N

Lexie card.1 playing card

Lexie card.2 money card

Vocable=lexie map

Page 28: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 28/30

Construct systems where quality can be improved "on demand" by users

• a priori through interactive disambiguation in the source language

• or a posteriori by correcting the pivot representation (UNL or other) through any language (as in MultiMeteo)

==> In the 2 cases, all versions (in all languages) are improved

• possibility to merge MT

multilingual generation

computer-aided authoring

Page 29: International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002

Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 29/30

Conclusion

• 4 keys to open the door to MT of adequate quality to all languages

• On the technical side, dramatically increase the use of learning techniquesuse pivot architectures, the most universally usable pivot being UNL

• On the organizational side,cooperatively develop open source linguistic resources on the webconstruct systems where quality can be improved "on demand" by users

• On the practical side, seek keys to unlock private investment, public funding, voluntary

cooperationcould this conference become a decisive turning point?