language divergences and solutions advanced machine translation seminar alison alvarez

of 43 /43
Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez

Author: nathanael-cowell

Post on 02-Apr-2015

214 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Slide 1

Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez Slide 2 Overview Introduction Morphology Primer Translation Mismatches Types Solutions Translation Divergences Types Solutions Different MT Systems Generation Heavy Machine Translation DUSTer Slide 3 Source Target Languages dont encode the same information in the same way Makes MT complicated Keeps all of us employed Slide 4 Morphology in a Nutshell Morphemes are word parts Work +er Iki +ta +ku +na +ku +na +ri +ma +shi +ta Types of Morphemes Derivational: makes new word Inflectional: adds information to an existing word Slide 5 Morphology in a Nutshell Analytic/Isolating little or no inflectional morphology, separate words Vietnamese, Chinese I was made to go Synthetic Lots of inflectional morphology Fusional vs. Agglutinating Romance Languages, Finnish, Japanese, Mapudungun Ika (to go) +se (to make/let) +rare (passive) +ta (past tense) He need +s (3 rd person singular) it. Slide 6 Translation Differences Types Translation Mismatches Different information from source to target Translation Divergences Same information from source to target, but the meaning is distributed differently in each language Slide 7 Translation Mismatches the information that is conveyed is different in the source and target languages Types: Lexical level Typological level Slide 8 Lexical Mismatches A lexical item in one language may have more distinctions than in another Brother otouto Younger Brother Ani-san Older Brother Slide 9 Typological Mismatches Mismatch between languages with different levels of grammaticalization One language may be more structurally complex Source marking, Obligatory Subject Slide 10 Typological Mismatches Source: Quechua vs. English (they say) s/he was singing --> takisharansi taki (sing) +sha (progressive) +ra (past) + n (3rd sg) +si (reportative) Obligatory Arguments: English vs. Japanese Kusuri wo Nonda --> (I, you, etc.) took medicine. Makasemasu! -->(Ill) leave (it) to (you) Slide 11 Translation Mismatch Solutions More information --> Less information (easy) Less information --> More information (hard) Context clues Language Models Generalization Formal representations Slide 12 Translation Divergences the same information is conveyed in source and target texts Divergences are quite common Occurs in about 1 out of every three sentences in the TREC El Norte Newspaper corpus (Spanish-English) Sentences can have multiple kinds of divergences Slide 13 Translation Divergence Types Categorial Divergence Conflational Divergence Structural Divergence Head Swapping Divergence Thematic Divergence Slide 14 Categorial Divergence Translation that uses different parts of speech Tener hambre (have hunger) --> be hungry Noun --> adjective Slide 15 Conflational Divergence The translation of two words using a single word that combines their meaning Can also be called a lexical gap X stab Z --> X dar pualadas a Z (X give stabs to Z) glastuinbouw --> cultivation under glass Slide 16 Structural Divergence A difference in the realization of incorporated arguments PP to Object X entrar en Y (X enter in Y) --> X enter Y X ask for a referendum --> X pedir un referendum (ask-for a referendum) Slide 17 Head Swapping Divergence Involves the demotion of a head verb and the promotion of a modifier verb to head position S NPVP NV PP VP Yo entro en el cuarto corriendo S NPVP NVPP I ran into the room. Slide 18 Thematic Divergence This divergence occurs when sentence arguments switch argument roles from one language to another X gustar a Y (X please to Y) --> Y like X Slide 19 Divergence Solutions and Statistical/EBMT Systems Not really addressed explicitly in SMT Covered in EBMT only if it is covered extensively in the data Slide 20 Divergence Solutions and Transfer Systems Hand-written transfer rules Automatic extraction of transfer rules from bi-texts Problematic with multiple divergences Slide 21 Divergence Solutions and Interlingua Systems Meluks Deep Syntactic Structure Jackendoffs Lexical Semantic Structure Both require explicit symmetric knowledge from both source and target language Expensive Slide 22 Divergence Solutions and Interlingua Systems John swam across a river Juan cruza el ro nadando [event CAUSE JOHN [event GO JOHN [path ACROSS JOHN [position AT JOHN RIVER]]] [manner SWIM+INGLY]] Slide 23 Generation-Heavy MT Built to address language divergences Designed for source-poor/target-rich translation Non-Interlingual Non-Transfer Uses symbolic overgeneration to account for different translation divergences Slide 24 Generation-Heavy MT Source language syntactic parser translation lexicon Target language lexical semantics, categorial variations & subcategorization frames for overgeneration Statistical language model Slide 25 GHMT System Slide 26 Analysis Stage Independent of Target Language Creates a deep syntactic dependency Only argument structure, top-level conceptual nodes & thematic-role information Should normalize over syntactic & morphological phenomena Slide 27 Translation Stage Converts SL lexemes to TL lexemes Maintains dependency structure Slide 28 Analysis/Translation Stage GIVE (v) [cause go] I agent STAB (n) theme JOHN goal Slide 29 Generation Stage Lexical & Structural Selection Conversion to a thematic dependency Uses syntactic-thematic linking map loose linking Structural expansion Addresses conflation & head-swapped divergences Turn thematic dependency to TL syntactic dependency Addresses categorial divergence Slide 30 Generation Stage: Structural Expansion Slide 31 Generation Stage Linearization Step Creates a word lattice to encode different possible realizations Implemented using oxyGen engine Sentences ranked & extracted Nitrogens statistical extractor Slide 32 Generation Stage Slide 33 GHMT Results 4 of 5 Spanish-English divergences can be generated using structural expansion & categorial variations The remaining 1 out of 5 needed more world knowledge or idiom handling SL syntactic parser can still be hard to come by Slide 34 Divergences and DUSTer Helps to overcome divergences for word alignment & improve coder agreement Changes an English sentence structure to resemble another language More accurate alignment and projection of dependency trees without training on dependency tree data Slide 35 DUSTer Motivation for the development of automatic correction of divergences 1. Every Language Pair has translation divergences that are easy to recognize 2. Knowing what they are and how to accommodate them provides the basis for refined word level alignment 3. Refined word-level alignment results in improved projection of structural information from English to another language Slide 36 DUSTer Slide 37 Bi-text parsed on English side only Linguistically Motivated & common search terms Conducted on Spanish & Arabic (and later Chinese & Hindi) Uses all of the divergences mentioned before, plus a light verb divergence Try put to trying poner a prueba Slide 38 DUSTer Rule Development Methods Identify canonical transformations for each divergence type Categorize English sentences into divergence type or none Apply appropriate transformations Humans align E E foreign language Slide 39 "LightVB kill(N)" (LightVB = light verb) # Presumably, this will work for "kill" => "give death to # "borrow" => "take lent ">"LightVB kill(N)" (LightVB = light verb) # Presumably, this will work for "kill" => "give death to # "borrow" => "take lent (thing) to # "hurt" => "make harm to # "fear" => "have fear of # "desire" => "have interest in # "rest" => "have repose on # "envy" => "have envy of type1.B.X [English{2 1 3} Spanish{2 1 3 4 5} ] [ Verb [ Noun ] [ Noun ] ] [ LightVB [ Noun ] [ Noun ] [ Oblique [ Noun ] ] ]">"LightVB kill(N)" (LightVB = light verb) # Presumably, this will work for "kill" => "give death to # "borrow" => "take lent " title="DUSTer Rules # "kill" => "LightVB kill(N)" (LightVB = light verb) # Presumably, this will work for "kill" => "give death to # "borrow" => "take lent "> DUSTer Rules # "kill" => "LightVB kill(N)" (LightVB = light verb) # Presumably, this will work for "kill" => "give death to # "borrow" => "take lent (thing) to # "hurt" => "make harm to # "fear" => "have fear of # "desire" => "have interest in # "rest" => "have repose on # "envy" => "have envy of type1.B.X [English{2 1 3} Spanish{2 1 3 4 5} ] [ Verb [ Noun ] [ Noun ] ] [ LightVB [ Noun ] [ Noun ] [ Oblique [ Noun ] ] ] Slide 40 DUSTer Results Slide 41 Conclusion Divergences are common They are not handled well by most MT systems GHMT can account for divergences, but still needs development DUSTer can handle divergences through structure transformations, but requires a great deal of linguistic knowledge Slide 42 The End Questions? Slide 43 References Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution," Computational Linguistics, 20:4, pp. 597--633, 1994. Dorr, Bonnie J. and Nizar Habash, "Interlingua Approximation: A Generation-Heavy Approach", In Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 1--6, 2002 Dorr, Bonnie J., Clare R. Voss, Eric Peterson, and Michael Kiker, "Concept Based Lexical Selection," Proceedings of the AAAI-94 fall symposium on Knowledge Representation for Natural Language Processing in Implemented Systems, New Orleans, LA, pp. 21--30, 1994. Dorr, Bonnie J., Lisa Pearl, Rebecca Hwa, and Nizar Habash, "DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment," Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 31--43, 2002. Habash, Nizar and Bonnie J. Dorr, "Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation", In Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 84--93, 2002. Haspelmath, Martin. Understanding Morphology. Oxford Univeristy Press, 2002. Kameyama, Megumi and Ryo Ochitani, Stanley Peters Resolving Translation Mismatches With Information Flow Annual Meeting of the Assocation of Computational Linguistics, 1991 Slide 44 Other Divergences Idioms Aspectual Divergences Knowledge outside of Lexical Semantics