language divergences and solutions

43
Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez

Upload: cybil

Post on 12-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Language Divergences and Solutions. Advanced Machine Translation Seminar Alison Alvarez. Overview. Introduction Morphology Primer Translation Mismatches Types Solutions Translation Divergences Types Solutions Different MT Systems Generation Heavy Machine Translation DUSTer. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language Divergences and Solutions

Language Divergences and Solutions

Advanced Machine Translation Seminar

Alison Alvarez

Page 2: Language Divergences and Solutions

Overview

Introduction Morphology Primer Translation Mismatches

Types Solutions

Translation Divergences Types Solutions

Different MT Systems Generation Heavy Machine Translation DUSTer

Page 3: Language Divergences and Solutions

Source ≠ Target

Languages don’t encode the same information in the same wayMakes MT complicatedKeeps all of us employed

Page 4: Language Divergences and Solutions

Morphology in a Nutshell

Morphemes are word partsWork +er Iki +ta +ku +na +ku +na +ri +ma +shi +ta

Types of MorphemesDerivational: makes new word Inflectional: adds information to an existing

word

Page 5: Language Divergences and Solutions

Morphology in a Nutshell Analytic/Isolating

little or no inflectional morphology, separate words Vietnamese, Chinese I was made to go

Synthetic Lots of inflectional morphology Fusional vs. Agglutinating Romance Languages, Finnish, Japanese, Mapudungun Ika (to go) +se (to make/let) +rare (passive) +ta (past

tense) He need +s (3rd person singular) it.

Page 6: Language Divergences and Solutions

Translation Differences

TypesTranslation Mismatches

Different information from source to target

Translation Divergences Same information from source to target, but the

meaning is distributed differently in each language

Page 7: Language Divergences and Solutions

Translation Mismatches

“…the information that is conveyed is different in the source and target languages”

Types: Lexical levelTypological level

Page 8: Language Divergences and Solutions

Lexical Mismatches

A lexical item in one language may have more distinctions than in another

Brother

otouto

Younger Brother

兄さん

Ani-san

Older Brother

Page 9: Language Divergences and Solutions

Typological Mismatches

Mismatch between languages with different levels of grammaticalization

One language may be more structurally complex

Source marking, Obligatory Subject

Page 10: Language Divergences and Solutions

Typological Mismatches

Source: Quechua vs. English (they say) s/he was singing --> takisharansi taki (sing) +sha (progressive) +ra (past) + n (3rd sg)

+si (reportative)

Obligatory Arguments: English vs. Japanese Kusuri wo Nonda --> (I, you, etc.) took medicine. Makasemasu! -->(I’ll) leave (it) to (you)

Page 11: Language Divergences and Solutions

Translation Mismatch Solutions

More information --> Less information (easy) Less information --> More information (hard)

Context clues Language Models Generalization Formal representations

Page 12: Language Divergences and Solutions

Translation Divergences

“…the same information is conveyed in source and target texts”

Divergences are quite common Occurs in about 1 out of every three

sentences in the TREC El Norte Newspaper corpus (Spanish-English)

Sentences can have multiple kinds of divergences

Page 13: Language Divergences and Solutions

Translation Divergence Types

Categorial Divergence Conflational Divergence Structural Divergence Head Swapping Divergence Thematic Divergence

Page 14: Language Divergences and Solutions

Categorial Divergence

Translation that uses different parts of speech

Tener hambre (have hunger) --> be hungry

Noun --> adjective

Page 15: Language Divergences and Solutions

Conflational Divergence

The translation of two words using a single word that combines their meaning

Can also be called a lexical gap X stab Z --> X dar puñaladas a Z (X give stabs

to Z) glastuinbouw --> cultivation under glass

Page 16: Language Divergences and Solutions

Structural Divergence

A difference in the realization of incorporated arguments

PP to Object X entrar en Y (X enter in Y) --> X enter Y  X ask for a referendum --> X pedir un

referendum (ask-for a referendum)

Page 17: Language Divergences and Solutions

Head Swapping Divergence

Involves the demotion of a head verb and the promotion of a modifier verb to head position

S

NP VP

N V PP VP

Yo entro en el cuarto corriendo

S

NP VP

N V PP

I ran into the room.

Page 18: Language Divergences and Solutions

Thematic Divergence

This divergence occurs when sentence arguments switch argument roles from one language to another

X gustar a Y (X please to Y) --> Y like X

Page 19: Language Divergences and Solutions

Divergence Solutions and Statistical/EBMT Systems Not really addressed explicitly in SMT Covered in EBMT only if it is covered

extensively in the data

Page 20: Language Divergences and Solutions

Divergence Solutions and Transfer Systems Hand-written transfer rules Automatic extraction of transfer rules from

bi-texts Problematic with multiple divergences

Page 21: Language Divergences and Solutions

Divergence Solutions and Interlingua Systems Mel’čuk’s Deep Syntactic Structure Jackendoff’s Lexical Semantic Structure Both require “explicit symmetric knowledge” from

both source and target language Expensive

Page 22: Language Divergences and Solutions

Divergence Solutions and Interlingua Systems

John swam across a river

Juan cruza el río nadando

[event CAUSE JOHN

[event GO JOHN [path ACROSS JOHN [position AT JOHN RIVER]]]

[manner SWIM+INGLY]]

Page 23: Language Divergences and Solutions

Generation-Heavy MT

Built to address language divergences Designed for source-poor/target-rich

translation Non-Interlingual Non-Transfer Uses symbolic overgeneration to account

for different translation divergences

Page 24: Language Divergences and Solutions

Generation-Heavy MT

Source languagesyntactic parser translation lexicon

Target language lexical semantics, categorial variations &

subcategorization frames for overgenerationStatistical language model

Page 25: Language Divergences and Solutions

GHMT System

Page 26: Language Divergences and Solutions

Analysis Stage

Independent of Target Language Creates a deep syntactic dependency Only argument structure, top-level

conceptual nodes & thematic-role information

Should normalize over syntactic & morphological phenomena

Page 27: Language Divergences and Solutions

Translation Stage

Converts SL lexemes to TL lexemes Maintains dependency structure

Page 28: Language Divergences and Solutions

Analysis/Translation Stage

GIVE (v)

[cause go]

I

agent

STAB (n)

theme

JOHN

goal

Page 29: Language Divergences and Solutions

Generation Stage

Lexical & Structural Selection Conversion to a thematic dependency

Uses syntactic-thematic linking map “loose” linking

Structural expansion Addresses conflation & head-swapped divergences

Turn thematic dependency to TL syntactic dependency

Addresses categorial divergence

Page 30: Language Divergences and Solutions

Generation Stage: Structural Expansion

Page 31: Language Divergences and Solutions

Generation Stage

Linearization Step Creates a word lattice to encode different

possible realizations Implemented using oxyGen engine

Sentences ranked & extracted Nitrogen’s statistical extractor

Page 32: Language Divergences and Solutions

Generation Stage

Page 33: Language Divergences and Solutions

GHMT Results

4 of 5 Spanish-English divergences “can be generated using structural expansion & categorial variations”

The remaining 1 out of 5 needed more world knowledge or idiom handling

SL syntactic parser can still be hard to come by

Page 34: Language Divergences and Solutions

Divergences and DUSTer

Helps to overcome divergences for word alignment & improve coder agreement

Changes an English sentence structure to resemble another language

More accurate alignment and projection of dependency trees without training on dependency tree data

Page 35: Language Divergences and Solutions

DUSTer

Motivation for the development of automatic correction of divergences

1. “Every Language Pair has translation divergences that are easy to recognize”

2. “Knowing what they are and how to accommodate them provides the basis for refined word level alignment”

3. “Refined word-level” alignment results in improved projection of structural information from English to another language

Page 36: Language Divergences and Solutions

DUSTer

Page 37: Language Divergences and Solutions

DUSTer

Bi-text parsed on English side only “Linguistically Motivated” & common search

terms Conducted on Spanish & Arabic (and later

Chinese & Hindi) Uses all of the divergences mentioned before,

plus a “light verb” divergence Try put to trying poner a prueba

Page 38: Language Divergences and Solutions

DUSTer Rule Development Methods Identify canonical transformations for each

divergence type Categorize English sentences into

divergence type or “none” Apply appropriate transformations Humans align E E’ foreign language

Page 39: Language Divergences and Solutions

DUSTer Rules

# "kill" => "LightVB kill(N)" (LightVB = light verb)# Presumably, this will work for "kill" => "give death to”# "borrow" => "take lent (thing) to”# "hurt" => "make harm to”# "fear" => "have fear of”# "desire" => "have interest in”# "rest" => "have repose on”# "envy" => "have envy of”type1.B.X [English{2 1 3} Spanish{2 1 3 4 5} ][ Verb<1,i,CatVar:V_N> [ Noun<2,j,Subj> ] [ Noun<3,k,Obj> ] ] <--> [ LightVB<1,Verb>[ Noun<2,j,Subj> ] [ Noun<3,i,Obj> ]

[ Oblique<4,Pred,Prep> [ Noun<5,k,PObj> ] ] ]

Page 40: Language Divergences and Solutions

DU

ST

er R

esul

ts

Page 41: Language Divergences and Solutions

Conclusion

Divergences are common They are not handled well by most MT

systems GHMT can account for divergences, but

still needs development DUSTer can handle divergences through

structure transformations, but requires a great deal of linguistic knowledge

Page 42: Language Divergences and Solutions

The End

Questions?

Page 43: Language Divergences and Solutions

ReferencesDorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution,"

Computational Linguistics, 20:4, pp. 597--633, 1994.Dorr, Bonnie J. and Nizar Habash, "Interlingua Approximation: A Generation-Heavy Approach", In

Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 1--6, 2002

Dorr, Bonnie J., Clare R. Voss, Eric Peterson, and Michael Kiker, "Concept Based Lexical Selection," Proceedings of the AAAI-94 fall symposium on Knowledge Representation for Natural Language Processing in Implemented Systems, New Orleans, LA, pp. 21--30, 1994.

Dorr, Bonnie J., Lisa Pearl, Rebecca Hwa, and Nizar Habash, "DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment," Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 31--43, 2002.

Habash, Nizar and Bonnie J. Dorr, "Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation", In Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 84--93, 2002.

Haspelmath, Martin. Understanding Morphology. Oxford Univeristy Press, 2002. Kameyama, Megumi and Ryo Ochitani, Stanley Peters “Resolving Translation Mismatches With

Information Flow” Annual Meeting of the Assocation of Computational Linguistics, 1991