cs712: interlingua based mt - iit bombay interlingua based mt ... (marathi-hindi-english: case...

124
CS712: Interlingua Based MT Pushpak Bhattacharyya CSE Department IIT Bombay 28/1/13 1

Upload: ngominh

Post on 18-Apr-2018

230 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

CS712 Interlingua Based MT

Pushpak BhattacharyyaCSE Department

IIT Bombay28113

1

Empiricism vs Rationalism

bull Ken Church ldquoA Pendulum Swung too Farrdquo LILT 2011ndash Availability of huge amount of data what to do

with itndash 1950s Empiricism (Shannon Skinner Firth

Harris)ndash 1970s Rationalism (Chomsky Minsky)ndash 1990s Empiricism (IBM Speech Group AT amp

T)ndash 2010s Return of Rationalism

Introductionbull Machine Translation (MT) is a technique to translate

texts from one natural language to another natural language using a machine

bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target

languagebull Translation between distant languages is a difficult

taskndash Handling Language Divergence is a major

challenge

3

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic level Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-argument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel description

Semi-direct translation

4

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

5

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

6

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon

7

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 2: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Empiricism vs Rationalism

bull Ken Church ldquoA Pendulum Swung too Farrdquo LILT 2011ndash Availability of huge amount of data what to do

with itndash 1950s Empiricism (Shannon Skinner Firth

Harris)ndash 1970s Rationalism (Chomsky Minsky)ndash 1990s Empiricism (IBM Speech Group AT amp

T)ndash 2010s Return of Rationalism

Introductionbull Machine Translation (MT) is a technique to translate

texts from one natural language to another natural language using a machine

bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target

languagebull Translation between distant languages is a difficult

taskndash Handling Language Divergence is a major

challenge

3

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic level Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-argument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel description

Semi-direct translation

4

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

5

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

6

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon

7

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 3: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Introductionbull Machine Translation (MT) is a technique to translate

texts from one natural language to another natural language using a machine

bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target

languagebull Translation between distant languages is a difficult

taskndash Handling Language Divergence is a major

challenge

3

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic level Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-argument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel description

Semi-direct translation

4

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

5

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

6

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon

7

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 4: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic level Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-argument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel description

Semi-direct translation

4

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

5

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

6

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon

7

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 5: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

5

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

6

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon

7

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 6: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

6

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon

7

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 7: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) NounIncorporation- very common Indian LanguagePhenomenon

7

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 8: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip(India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aapjaanaa chaahate ho (who withhellip)

8

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 9: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergencendash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

9

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 10: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull स न हत भत (immediate past)ndash कधी आलास हा यतो इतकाच (M)ndash कब आय बस अभी आया (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull नःसशय भ व य (certainty in future)ndash आता तो मार खातो खास (M)ndash अब वह मार खायगा ह (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आ वासन (assurance)ndash मी त हाला उ या भटतो (M)ndash म आप स कल मलता ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

10

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 11: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Interlingua Based MT

11

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 12: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

12

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 13: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

IL based MT schematic

13

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 14: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

14

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 15: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

15

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 16: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

16

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 17: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules 17

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 18: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

18

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 19: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Architecture of KBMT

19

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 20: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Digression language typology

20

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 21: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety 22

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 22: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

23

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 23: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estasen la ĉielovia nomo estusanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tielankaŭ sur la tero4 Nian panon ĉiutagandonu al ni hodiaŭ5 Kaj pardonu al niniajn ŝuldojnkiel ankaŭ ni pardonasal niaj ŝuldantoj6 Kaj ne konduku ninen tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine siasanctificate2 que tu regno veni3 que tu voluntate siafacitecomo in le celo etiamsuper le terra4 Da nos hodie nostrepan quotidian5 e pardona a nosnostre debitascomo etiam nos los pardona a nostredebitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomentuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobishodie5 et dimitte nobis debitanostrasicut et nos dimittimusdebitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 24: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san(health) ul (person) ej (place) o (noun)

25

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 25: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

26

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 26: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

27

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 27: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

28

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 28: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

29

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 29: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or

noun+genitive (the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

30

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 30: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

31

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 31: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

32

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 32: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -tandash bottle-NOM cave-from float-GER exit-PAST

33

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 33: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

A look at annotation

34

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 34: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Definition etc (12)(Eduard Hovy ACL 2010 tutorial on annotation)

bull Annotation (lsquotaggingrsquo) is the process of adding new information into raw data by humans annotators

bull Usually the information is added by many small individual decisions in many places throughout the data

bull The addition process usually requires some sort of mental decision that depends both on the raw data and on some theory or knowledge that the annotator has internalized earlier

35

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 35: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Definition etc (12)

bull Typical annotation steps ndash Decide which fragment of the data to annotate ndash Add to that fragment a specific bit of

informationndash chosen from a fixed set of options

36

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 36: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Example of annotation sense marking

एक_4187 नए शोध_1138 क अनसार_3123 िजन लोग _1189 का सामािजक_43540 जीवन_125623 य त_48029 होता ह उनक दमाग_16168 क एक_4187 ह स_120425 म अ धक_42403 जगह_113368 होती ह

(According to a new research those people who have a busy social life have larger space in a part of their brain)

नचर यरोसाइस म छप एक_4187 शोध_1138 क अनसार_3123 कई_4118 लोग _1189 क दमाग_16168 क कन स पता_11431 चला क दमाग_16168 का एक_4187 ह सा_120425 ए मगडाला सामािजक_43540 य तताओ_1438 क साथ_328602 सामज य_166 क लए थोड़ा_38861 बढ़_25368 जाता ह यह शोध_1138 58 लोग _1189 पर कया गया िजसम उनक उ _13159 और दमाग_16168 क साइज़ क आकड़_128065 लए गए अमर क _413405 ट म_14077 न पाया_227806 क िजन लोग _1189 क सोशल नटव कग अ धक_42403 ह उनक दमाग_16168 का ए मगडाला वाला ह सा_120425 बाक _130137 लोग _1189 क तलना_म_38220 अ धक_42403 बड़ा_426602 ह दमाग_16168 का ए मगडाला वाला ह सा_120425 भावनाओ_1912 और मान सक_42151 ि थ त_1652 स जड़ा हआ माना_212436 जाता ह

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 37: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Ambiguity of लोग (People)bull लोग जन लोक जनमानस पि लक - एक स अ धक

यि त लोग क हत म काम करना चा हए ndash (English synset)

multitude masses mass hoi_polloi people the_great_unwashed - the common people generally separate the warriors from the mass power to the people

bull द नया द नया ससार व व जगत जहा जहान ज़माना जमाना लोक द नयावाल द नयावाल लोग - ससार म रहन वाल लोग महा मा गाधी का स मान पर द नया करती ह म इस द नया क परवाह नह करता आज क द नया पस क पीछ भाग रह ह ndash (English synset) populace public world - people in

general considered as a whole he is a hero in the eyes of the publicrdquo

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 38: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Sense Marked corpora in Marathi

Snapshot of a Marathi sense tagged paragraph

39

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 39: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

A good annotator is rare to find

bull An annotator has to understand BOTH language phenomena and the data

bull An annotation designer has to understand BOTH linguistics and statistics

40

Linguistics and Language phenomena

Data and statistical phenomena

Annotator

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 40: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

The completeness chain

bull MT is NLP-completebull NLP is AI-completebull AI id CS-complete

41

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 41: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Penn tagset (12)

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 42: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Penn tagset (22)

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 43: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Indian Language Tagset Noun

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 44: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Indian Language Tagset Verb Adjective Adverb

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 45: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Indian Language Tagset Particle

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 46: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Scale of effort involved in annotation Penn Treebank (12)

bull Wall Street Journal subcorpus of the PTB 2 release

bull 8 people working half time each about a year

bull The same group had previously worked on annotation guidelines and followedyet another year of an initial annotation of the WSJ corpus that was redonefor the PTB 2 version

47

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 47: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Scale of effort involved in annotation Penn Treebank (22)

bull The same group also ndash annotated most of the original Brown corpus with PTB

2 style treesndash 1 million words of the Switchboard corpus for total of

well over 3 million wordsndash POS tagged 3 million words of a

much larger WSJ corpusbull All in all the effort was over about 5 years and had

about 4-5 full time employeesbull Total effort 8 million words 20-25 man years

48

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 48: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Annotation effort Ontonotesbull Annotate 300K words per year

ndash comprising various genres of text (news conversational telephone speech weblogs usenet newsgroups broadcast talk shows)

ndash in three languages (English Chinese and Arabic) ndash with structural information (syntax and predicate argument

structure) and ndash shallow semantics (word sense linked to an ontology and

coreference)

bull 1 person (Dr Ann Taylor senior lecturer in YorUniv) had managed the treebank project for a year in the last stages of the project

bull Ontonotes annotation guidelines were frozen a priori 49

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 49: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Annotation effort Prague Discourse Treebank (12)

bull Phase 1 autumn 2007 through September 2009 collecting information getting acquainted with the annotation scheme of the Pennsylvania Discourse Treebank development of the annotation scenario (a team of three or four people)

bull Phase 2 from September 2009 till January 2010 5 (part time) annotators testing our scheme

50

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 50: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Annotation effort Prague Discourse Treebank (22)

bull Phase 3 from January 2010 till May 2012 5 annotators part time annotating the PDT (about 50000 sentences in more than 3000 documents)

bull Phase 4 from May 2012 till October 2012 testing for consistency corrections etc

51

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 51: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Back to Interlingua

52

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 52: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

53

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 53: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 54: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

UNL a United Nations projectbull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

55

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 55: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

56

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 56: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations

and84 attributelabels

57

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 57: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Sentence embeddings

Deepa claimed that she had composed a poem[UNL]

agt(claimentrypast Deepa)obj(claimentrypast 01)agt01(composepastentrycomplete she)obj01(composepastentrycomplete

poemindef)[UNL]

58

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 58: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

59

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 59: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

60

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 60: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

61

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 61: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan RajitaMany publications

62

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 62: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 63: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 64: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined

Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute

generation are formed on Stanford Dependency relations

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 65: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Text Simplifierbull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 66: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

UNL Mergingbull Merger takes multiple UNLs and merges them to form one

UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 67: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 68: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

System Processing UnitsSyntactic

Processing

bull NER

bull Stanford Dependency Parser

bull XLE Parser

Semantic Processing

bull Stems of wordsbull WSDbull Noun originating

from verbsbull Feature

Generationbull Noun featuresbull Verb features

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 69: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

SYNTACTIC PROCESSING

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 70: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Syntactic Processing NERbull NER tags the named entities in the sentence with

their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 71: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 72: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 73: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the riverInput

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 74: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 75: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas beeating eatapple applefork forkbank bankriver river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 76: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 77: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)eating eat(iclgtconsume equgtconsumption)apple apple(iclgtedible fruitgtthing)fork fork(iclgtcutlerygtthing)bank bank(iclgtslopegtthing)river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 78: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2)instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 79: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Features of UWsbull Each UNL relation requires some properties of their

UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Propertyagt(uw1 uw2) uw1 Volitional verbmet(uw1 uw2) uw2 Abstract thingdur(uw1 uw2) uw2 Time periodins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 80: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun FeatureWill-Smith PERSON ANIMATEbank PLACEfork CONCRETE

Algorithm

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 81: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Types of verbsbull Verb features are based on the types of

verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or

just happen without any performerndash Be Verbs which tells about existence of

something or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 82: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 83: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 84: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 85: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Relation Generation (22)

Conditions Relations Generatednsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 86: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 87: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Attribute Generation (22)

Conditions Attributes Generatedeat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 88: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 89: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 90: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Metricsbull tgen = relationgen (UW1gen UW2gen)bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 91: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 92: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 93: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 94: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 95: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 96: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Relation-wise F-score

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 97: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Attribute

0

005

01

015

02

025

03

035

04

045

05

Scenario1 Scenario2 Scenario3 Scenario4

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 98: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

0

5

10

15

20

25

30

35

40

Scenario1 Scenario2 Scenario3 Scenario4

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 99: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

ERROR ANALYSIS

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 100: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Wrong Relations Generated

0

5000

10000

15000

20000

25000

30000

35000

obj mod aoj man and agt plc qua pur gol tim or overall

False positive

False positive

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 101: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Missed Relations

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

obj mod aoj man and agt plc qua pur gol tim or overall

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 102: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 103: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football

club James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 104: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation

possibilityThe former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 105: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 106: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 107: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Step-through the Deconverter Module Output

UNL Expression

obj(contact(hellip

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान इस आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Linearization इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam01

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 108: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Lexeme Selection

[सपक]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पहचान का यि त]contact(iclgtrepresentative)ldquo(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 109: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आपthis यह

farmer कसान

agtobj contact सपक

pur

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 110: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Morphology Generation Nouns

Suffix Attribute values uoM NNUMploblique U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique oM NNANOTCHFploblique

The boy saw meलड़क न मझ दखा

Boys saw meलड़क न मझ दखा

The King saw meराजा न मझ दखा

Kings saw meराजाऒ न मझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 111: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 112: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

After Morphology Generation

you आपthis यह

farmer कसान

agtobj contact सपक

pur

you आपthis इस

farmer कसान

agtobj contact सपक क िजए

pur

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 113: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Function Word Insertionसपक क िजए कसान यह आप तालक मचर खटाव सपक क िजए कसान को इसक लए आप या तालक क मचर खटाव

you आप

this इसकलए

farmer कसान को

agtobj contact सपक क िजए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic को

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 114: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Linearization

इस आप मचर This you manchar region

खटाव तालक कसान khatav taluka farmers

सपक contact

you

this

farmer

agtobj

pur

plc

Contact सपक क िजए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 115: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agtobj

contact

pur

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 116: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

agt aoj obj ins Agt L L L Aoj L L Obj R Ins

you

this

farmer

agtobj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragtAfter_set =

obj pur

agt

Topo Sort each group pur agt objthis you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 117: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

talukafarmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 118: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

सपक कसान यह आप तालक मचर खटाव contact farmer this you region taluka manchar khatav

Case Identification

सपक कसान यह आप तालक मचर खटाव

contact farmer this you region taluka manchar khatav

Morphology Generation

सपक क िजए कसान यह आप contact imperative farmerpl this you region तालक मचर खटाव taluka manchar Khatav

Function Word Insertion

सपक क िजए कसान को इसक लए आप contact farmers this for you region या तालक क मचर खटाव or taluka of Manchar Khatav

Syntax Linearization

इसक लए आप मचर या खटाव This for you manchar region or khatav तालक क कसान को सपक क िजए | taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 119: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 120: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible

Adequacy How much meaning of the reference sentence isconveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 121: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Sample Translations

Hindi Output Fluency Adequacy

कवक क कारण आम क नाजक पि तया य द झलस रह ह 05 तशत का बोड म ण 10 ल टर पानी क साथ तो छड़का जाना चािहए

2 3

पर ण क अन प नय मत प स फल क अ छ व क लए खाद क खराक द जानी चा हए

3 4

हम मथी का फसल क बाद बोता ह मथी और ध नया फसल को अ बढ़न या अ बढ़न बता

1 1

जीवाि वक स मण स इसक जड़ भा वत होती ह 4 4

इम का प ी रटाइट का प रवार को सब धत होता ह और थोड़ा यह शतरमग क साथ समान दखती ह

2 3

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 122: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Results BLEU Fluency Adequacy

Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Num

ber o

f se

nten

ces

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation using Fluency alone

Caution Domain diversity Speaker diversity

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124

Page 123: CS712: Interlingua Based MT - IIT Bombay Interlingua Based MT ... (Marathi-Hindi-English: case marking and postpositions do not transfer ... English version (traditional) 1

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

124